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Abstract 

We consider three problems in machine learning: 
e concept learning in the PAC model 
e mobile robot environment learning 


e learning-based approaches to protein structure prediction 


In the PAC framework, we give an efficient algorithm for learning any function on k terms by 
general DNF. On the other hand, we show that in a well-studied restriction of the PAC model 
where the learner is not allowed to use a more expressive hypothesis (such as general DNF), 
learning most symmetric functions on & terms is NP-hard. 


In the area of mobile robot environment learning, we introduce the problem of piecemeal learn- 
ing an unknown environment. The robot must learn a complete map of its environment, while 
satisfying the constraint that periodically it has to return to its starting position (for refueling, 
say). For environments that can be modeled as grid graphs with rectangular obstacles, we 
give two piecemeal learning algorithms in which the robot traverses a linear number of edges. 
For more general environments that can be modeled as arbitrary undirected graphs, we give a 
nearly linear algorithm. 


The final part of the thesis applies machine learning to the problem of protein structure predic- 
tion. Most approaches to predicting local 3D structures, or motifs, are tailored towards motifs 
that are already well-studied by biologists. We give a learning algorithm that is particularly 
effective in situations where large numbers of examples of the motif are not known. These are 
precisely the situations that pose significant difficulties for previously known methods. We have 
implemented our algorithm and we demonstrate its performance on the coiled coil motif. 
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CHAPTER 1 


Introduction 


There are many reasons we want machines, or computers, to learn. A machine that can learn is 
able to use its experience to help itself in the future. Such a machine can improve its performance 
on some task after performing the task several times. This is useful for computer scientists, 
since it means we do not have to consider all the possible scenarios a machine might encounter. 
Such a machine is able to adapt to various conditions or environments, or even to changing 
environments. A machine that is able to learn can also help push science forward. It may be 
able to speed up the learning process for humans, or it may be able discern patterns or do things 
which humans are incapable of doing. For example, we may want to build a machine that can 
learn patterns that aid in medical diagnosis, or that may be able to learn how to understand 
and process speech. Or we might want to build an autonomous robot that can learn to walk 
through difficult or unexpected terrain, or that can learn a map of its environment. This robot 
could then be used to explore environments that are too dangerous for humans, such as the 
surface of other planets. 

In this thesis, we study three particular problems in machine learning. In order to study 
any machine learning problem, we must first specify the model of learning we are interested 
in. There are many different possible models, and a model should be chosen according to the 
learning application we are interested in. Once we have specified the model we are looking at, 
we can give algorithms and show results within the model. There are several things which any 


“model of learning” must specify [69, 72, 44]: 
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1. Learner: Who is doing the learning? In this thesis, we consider the learner to be a 
machine, such as a computer or a robot. Sometimes the machine is assumed to have 
limited computational power (e.g., the machine is a finite automaton), but in this thesis 


we assume that the machine is as powerful as a Turing machine. 


2. Domain: What is being learned? One of the most well-studied types of learning is 
concept learning where the learner is trying to come up with a “rule” to separate positive 
examples from negative examples. For example, the learner may be trying to distinguish 
chairs from things which are not chairs. There are many other types of things that can 
be learned, such as an unknown environment (e.g., a new city) or an unknown technique 


(e.g., how to drive). 


3. Prior Knowledge: What does the learner know about the domain initially? This gen- 
erally restricts the learner’s uncertainty and/or biases and expectations about unknown 
domains. This tells what the learner knows about what is possible or probable in the 
domain. For example, the learner may know that the unknown concept is representable 
in a certain way. That is, the unknown concept might be known to be representable as a 


disjunction of features, or as a graph. 


4. Information Source: How is the learner informed about the domain? The learner may 
be given labeled examples. For instance, the learner may be given examples of things 
which are chairs, and examples of things which are not chairs. The learner may get 
information about a domain by asking questions of a teacher (e.g, “Is a stool a chair?”). 
The learner may get information about its domain by actively experimenting with it (e.g, 


it may learn a map of a new city by walking around in it). 


5. Performance Criteria: How do we know whether, or how well, the learner has learned? 
Different performance criteria include accuracy and efficiency. For accuracy, the learner 
may be evaluated by its error rate, its correctness of description, or the number of mis- 
takes it made during learning. For efficiency, the learner may be evaluated on the amount 
of computation it does and the amount of information it needs (e.g., the number of exam- 
ples it needs). In addition, the learner may be required to have a particular hypothesis 


representation of an unknown concept, or it may only need to have predictive output (i.e., 
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the learner does not need a representation of the unknown concept, just a way to label 


new instances as either positive or negative). 


Different applications require different models of machine learning. In this thesis, we con- 
sider three models of machine learning. The first part of the thesis studies a theoretical model 
of concept learning. For this model, we study learnability and give an efficient algorithm for 
learning a family of concept classes. The second part of the thesis studies mobile robot naviga- 
tion and environment learning. We introduce a model of exploration, which we call piecemeal 
learning, and give efficient algorithms for piecemeal learning unknown environments. The fi- 
nal part of the thesis applies machine learning to the problem of protein structure prediction. 
We introduce a learning technique that helps gather information on protein structures that 
biologists are interested in, but do not know much about yet. 

We now give a more detailed summary of this thesis, and outline some of the contributions 


of this thesis to machine learning, mobile robot navigation, and protein structure prediction. 


Concept learning in the PAC framework 


Much of the machine learning literature has been devoted to the problem of concept learning. 
We study concept learning in the Probably Approximately Correct (PAC) framework [74]. The 
object of a PAC learning algorithm is to approximately infer an unknown concept that belongs 
to some known concept class. For our purposes, it suffices to view the problem as finding 
a concept consistent with a given set of labeled examples. Figure 1.1 shows the information 
presented to the learner at the start of learning, and what the learner must produce in order 
to learn. The examples are assumed to be a “representative sample” of future examples the 
learner might see. Performance is measured by the number of examples used for learning, 
the time-complexity of the learning algorithm, and the accuracy of the learned concept. We 
consider two standard versions of the PAC model: in one, the learner is required to produce as 
output a hypothesis belonging to the same class as the concept to be learned, and in the other, 
the learner’s hypothesis can be any polynomial-time algorithm. 

For this model, we study the problem of learning the concept classes of functions on & 
terms. Concept classes that can be represented by functions on & terms include k-term DNF 


(disjunctive normal form formulae with at most k terms), k-term exclusive-or, and r-of-k-term 
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threshold functions. We give an efficient algorithm for PAC-learning any function on k terms 
by general DNF. We also show that for most symmetric functions on & terms, if the learner 
is required to output a hypothesis of the same concept class, then learning is NP-complete. 
Thus, our results illustrate the importance of hypothesis representation. In particular, for most 
concept classes of symmetric functions on & terms, learning the concept by itself is hard, but 


learning it by general DNF is easy. 


(a) (b) 


Figure 1.1: Concept learning with labeled examples. (a) Initially, the learner is given a set 
of labeled examples. The positive examples are denoted by +, and the negative examples are 
denoted by —. (b) The goal of the learner is to find a concept consistent with these examples. 
That is, the learner wants to find a rule that differentiates the positive examples from the 
negative examples. 


Environment learning 


In the second part of this thesis, we consider an active learning model where an autonomous 
robot must learn a map of its environment (see Figure 1.2). No examples are presented to 
the robot. Instead, it learns about the environment through active experimentation: it walks 
around in the environment. We introduce the problem of piecemeal learning of an unknown 
environment. The robot’s goal is to learn a complete map of its environment, while satisfying 


the constraint that it must return every so often to its starting position. The piecemeal con- 
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straint models situations in which the robot must learn “a piece at a time.” Unlike previous 
environment learning work, our work does not assume that the robot has sufficient resources to 
complete its learning task in one continuous phase; this is often an unrealistic assumption, as 
robots have limited power. After some exploration, the robot may need to recharge or refuel. 
Or, the robot may be exploring a dangerous environment, and after some time it may need to 
“cool down” or get maintenance. Or, the robot might have some other task to perform, and 
the piecemeal constraint enables “learning on the job.” 

The environment is modeled as an arbitrary, undirected graph, which is initially unknown 
to the robot. The learner’s performance is measured by the number of edges it traverses while 
exploring. For environments that can be modeled as grid graphs with rectangular obstacles, 
we give two piecemeal learning algorithms in which the robot explores every vertex and edge 
in the graph by traversing a linear number of edges. For more general environments that can 
be modeled by an undirected graph, we give a piecemeal learning algorithm in which the robot 


traverses at most a nearly linear number of edges. 


(a) (b) 


Figure 1.2: Environment learning. (a) Initially the learner only knows its starting location. 
(b) The learner must build a map of its environment. 


Learning-based methods for protein structure prediction 


In the last part of this thesis, we again turn to concept learning, but here the learner is given 


both labeled and unlabeled examples (see Figure 1.3). Unlike the previous concept learning 
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model, here the labeled examples that the learner is given are not representative of the examples 
that the learner will see; moreover, the learner knows that this is the case. Unlike the other work 
in this thesis, the performance measure we use here is empirical and not theoretical. Within 


this model, we look at the particular application of protein structure prediction. 


(a) 


Figure 1.3: Concept learning with labeled and unlabeled examples. (a) The learner is given a 
set of labeled examples as well as a set of unlabeled examples. The positive examples are denoted 
by +, the negative examples are denoted by —, and the unlabelled examples are denoted by 
?. (b) The learner must find a concept which partitions these examples. The unlabeled points 
within the circle are assumed positive, and the unlabeled points outside of the circle are assumed 
negative. 


The goal of this work is to use computational techniques to learn about protein structures 
or folds which biologists do not yet know much about. Current techniques for predicting local 
three-dimensional structures, or motifs, are tailored towards folds which are already well-studied 
and documented by biologists. We give a learning algorithm that is particularly effective in 
situations where this is not the case. We generalize the 2-stranded coiled coil domain to learn 3- 
stranded coiled coils, and perhaps other similar motifs. As a consequence of this work, we have 
identified many new sequences that we believe contain coiled coil and coiled-coil-like structures. 
These sequences contain regions that are not identified by the best previous computational 


method, but are identified by our method. These sequences include mouse hepatitis virus, 
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human rotavirus, human T-cell lymphotropic virus, Human Immunodeficiency Virus (HIV) and 
Simian Immunodeficiency Virus (SIV). Independently, recent laboratory work has predicted the 
existence of a coiled-coil-like structure in HIV and SIV [19, 56], and our algorithm is able to 
predict the regions of this structure to within a few residues. We hope that biologists will direct 


their laboratory efforts towards testing other new candidate sequences which we identify. 


Organization of thesis 


The thesis is organized in three self-contained chapters. In Chapter 2, we study the problem of 
learning concept classes of functions on & terms in the PAC framework. In Chapter 3, we intro- 
duce the problem of piecemeal learning unknown environments, and give efficient algorithms 
for this problem. In Chapter 4, we study the problem of learning protein motifs. Finally, in 


Chapter 5, we finish with some concluding remarks. 


CHAPTER 2 


Learning functions on / terms 


2.1 Introduction 


Since its introduction, Valiant’s distribution-free or PAC learning framework [74] has been a 
well-studied model of concept learning. In this framework, the object of a learning algorithm is 
to approximately infer an unknown target concept that belongs to some known concept class. 
The learner is given examples chosen randomly according to a fixed but unknown distribution. 
The goal of the learner is to find (with high probability) a hypothesis that accurately predicts 
new instances as positive or negative examples of the concept. We consider here two standard 
versions of this model: in one, the learner is required to produce as output a hypothesis be- 
longing to the same class as the target concept, and in the other, the learner’s hypotheses may 
be any polynomial-time algorithm [64][50][66]. Several examples are known of concept classes 
that are hard to learn when hypotheses are restricted to belong to the same class as the target 
concept but easy to learn when they may belong to a larger class. In particular, Pitt and 
Valiant [64] showed that learning the class of k-term DNF formulas (that is, functions that can 
be represented by a disjunction of & monomials) is NP-hard if the learner is required to produce 
a k-term DNF formula, but is easy if the learner may use a representation of k-CNF formulas. 

In this chapter, we show that this phenomenon occurs for a broad class of formulas. In par- 
ticular, given constant k and function f, let C,,; be the class of concepts of the form f(T),...,T;) 


where T|,..., 7, are monomials. 50, for example, if f is the OR function then C, + is the class of 
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k-term DNF formulas. We show that for any symmetric function f (that is, f depends on only 
the number of inputs which are 1), learning the class C,,; by hypothesis class C, , is NP-hard 
except for f € {A, 7A, T, F}. The hardness result completely characterizes the complexity of 
learning C;+ by C,,; for symmetric functions f. For f € {7 F’}, learning C;, + is trivial, and for 
f € {A, 7A}, Cy. is the class of conjunctions or disjunctions respectively, so learning C,,; by 
C;,; is easy by a standard procedure for learning monomials. 

On the other hand, we also present a polynomial-time algorithm that learns the class of C, 
of all concepts f(7T1,...,7%), where f is any {0,1}-valued function of & inputs and 7),..., 7), 
are monomials, using a hypothesis class of general DNF. As a consequence, this algorithm will 
learn by DNF the concept classes C;, ; for which learning C;, by Cy; is NP-hard. 

A strategy for learning the special case of k-term DNF formulas is to learn by the hypothesis 
class of k-CNF (that is, conjunctions of disjunctions of size k). Every k-term DNF can be 
written as a k-CNF (since we can “distribute out” the k-term DNF) and k-CNF can be easily 
learned by standard procedures. Suppose, however, that we wish to learn in the same manner 
another class of concepts C,,; (that is, other than k-term DNF) for which learning Cy; by Cy; 
is NP-hard. Our results and related results by Fischer and Simon [41] show that exclusive-or 
(XOR) is one such function. In this case, an XOR of & monomials need not be representable 
as a k-CNF or as a k-DNF (for example, 7,22 6 x3 written as a DNF requires one term of size 
3, and written as a CNF requires one clause of size 3). In addition an XOR of & monomials 
need not have representation as a conjunction of XORs of size k. Thus, the standard strategy 
for learning k-term DNF or k-term CNF will not work for learning k-term XOR. 

Instead, our algorithm is based on a different strategy. Roughly, we use the fact that a 
monomial can be made false just by setting one of the literals that appears in it to 0. So, 
given a concept represented by a function on & unknown terms 7),...,7), if we are able to 
“guess” literals that appear in & — 1 of the monomials and consider only examples in which 
these monomials are false, we can then focus on the term remaining. Then, once we have been 
able to classify the examples that satisfy only one term of 7\,...,7;%, we can focus on those 


that satisfy pairs of terms, and so on. 
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2.2 Notation and definitions 


We will consider learning over the Boolean domain X, = {0,1}". An example is an element 
6 € {0,1}” and a concept c is a boolean function on examples. A concept class is a collection 
of concepts. For a given a target concept c, a labeled example for c is a pair (#,c(@)) where & 
is a positive example if c(v) = 1 and a negative example if c(#) = 0. For convenience, we will 
at times think of an example as a collection of variables or attributes x. In this case, for an 
example @ and variable « € X,, let (2) = 1 if the bit of # corresponding to x is 1, and 0 
otherwise. Also, we will use |c| to denote the size of concept c under some reasonable encoding. 

Let & be a constant. Define the concept class C, to be the set of all concepts f(71,..., 7%) 
where 7,,...,7, are monomials (conjunctions of literals) and f is any {0,1}-function on k 
boolean inputs. For example, class Cy includes the concept 71 %2.@%3%405, where “@” denotes the 
XOR function. For a given function f, let C,, be those concepts in C;, of the form f(T),...,T,) 
for the given f. We say that a function f is symmetric if the value of f depends only on the 
number of inputs that are 1. For a symmetric function f and integer i, we let f(%) denote the 
value of f when exactly 7 of its inputs are 1. 

We study learning in the distribution-free or Probably Approximately Correct (PAC) learn- 
ing model [74, 2]. In the PAC learning model, we assume that the learning algorithm has 
available an oracle EXAMPLES(c) that when queried, produces a labeled example (#, c(@)) 
according to a fixed but unknown probability distribution D. If C and H are concept classes, 
we say that algorithm A learns C’ by H if for some polynomial p, for all target concepts c € C, 
distributions D, and error parameters ¢ and 6: algorithm A halts in time p(n, +, ,|¢|) and 
outputs a hypothesis h € H that with probability at least 1 — 6 has error at most ¢. The error 
of a hypothesis h is the probability that h(#) 4 c(#) when @ is chosen from the distribution D. 

For the purposes of our positive results, it will be enough to consider the following sufficient 
condition for learnability [26]. An algorithm A is an “Occam algorithm” for C if on any sample 
(collection of labeled examples) of size m consistent with some c € C, algorithm A produces a 
consistent hypothesis of size at most |c|?m*% for constants a < 1,3 > 1. Blumer et al. show 


that any Occam algorithm for C’, producing hypotheses from H, will learn C by H. 
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2.3. The learning algorithm 


In this section, we present an algorithm that learns the class C, by the hypothesis class of 
general DNF. To illustrate the strategy used, let us consider first the problem of learning an 
XOR of two monotone monomials. 

Suppose the target concept is c = 7; @ Ty for monotone monomials 7; and Ty. We know 
each positive example @ satisfies one of 7; or 7 and fails to satisfy the other, and so has some 
vo; = 0 for x; in exactly one of 7; and 7. Given a set S of examples, let 5;, for 1 <i <n, 
be the set of those examples v7 for which v; = 0. If a variable a; is contained in exactly one 
of {7,72}, say x; is in T,, then the monomial 7; A T> is satisfied by every positive example in 
5; and no negative example in S. Therefore, we can actually find a monomial consistent with 
the positive examples in this S$; and the negative examples in 5, using the standard monomial 
learning procedure. 

So, we can learn an XOR of two terms as follows. For each variable x;, find a monomial 
M; consistent with positive examples in $; and with all negative examples, if such a monomial 
exists. Then, output as hypothesis the disjunction of the M;’s. The hypothesis produced is 
consistent with every negative example since no negative example satisfies any M;. Also, since 
every positive example lies in some $; for 2; in exactly one of {T7,, 7}, for each positive example 
we will have found some monomial it satisfies. 

We now present an Occam algorithm based on the above strategy that learns the class C;, 
using a hypothesis class of DNF. Without loss of generality, we may assume that the target 
concept is some f(7\,...,7)) where the 7; are monotone (we can think of non-monotone terms 
as monotone terms over the attribute space {x1, 7%, %2,o,..-,%n,%n}). The algorithm LEARN- 
k-'TERM takes as input a set S$ of m examples consistent with some function f(7),...,7);) on & 
monotone monomials and outputs a DNF of size O(n**") consistent with the given examples. 

The basic idea of LEARN-k-TERM is as follows. In the first iteration, the algorithm “handles” 
those positive examples that satisfy none of the terms. That is, if there are any such positive 
examples, the algorithm finds a set of monomials such that each of those positive examples 
satisfies one of the monomials. These monomials are then added to the DNF being built. In 
the second iteration, the algorithm tries to find a set of monomials for those positive examples 


that satisfy exactly one of the terms. This process is continued so that at each iteration the 
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algorithm focuses on examples that satisfy an increasing number of terms. Thus, at each value 
of r in the loop, the algorithm finds terms to handle all the positive examples that do not satisfy 
exactly r terms of the target concept. The ordering of r = & down to 0 is important to ensure 
that needed terms are not thrown away in step 9. Note that in step 5, we allow the 7; to be 
the same. This is done for purposes of simpler analysis—the algorithm would still work if we 


just considered the (") sets of r different variables. 


LEARN-k-TERM(S) 
Let P = the positive examples in S' 
Let N = the negative examples in S 
Initialize the DNF hypothesis h to {}. 
For r = k down to 0 Do 
For each set of r variables: {x;,,...,2;,} Do 
Let M be the monomial %;, --- ¥;,. 
Let U be the set of those examples v = (v1,...,0n) € P 
such that v,, = vi, v;, = 0. That is, U is the set 
of examples in P satisfying the term M. 
Let T’ be the monomial that is the conjunction of all x; 
such that every example 0 € U has v; = 1. (T is the most 


specific monotone monomial satisfied by all examples in U.) 


If no negative example in N satisfies term MT = %,,%;, ---%),7 
Then 
add MT as a term to the hypothesis h 
let PH P-U. 


Algorithm Learn-k-TeErRM clearly runs in time polynomial in m and n*, so we just need to 


prove the following theorem. 


Theorem 1 Algorithm LEARN-k-TERM, on m examples consistent with some function f of k 


monotone monomials over {0,1}", produces a consistent DNF hypothesis of size O(n**"). 


Proof: First notice the following facts. The DNF h produced by algorithm Learn-&-Term 
has at most n*¥ + n*-!+...+n = O(n*) terms of size O(n), so the size of the hypothesis is at 
most O(n*t!). Also, the hypothesis h is consistent with the set N of negative examples, since in 


step 9 any term that some negative example satisfies will never be included in the DNF. Thus 
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all we need to do is prove that for every positive example 0 € P, there is some term added to 
h which is satisfied by @. 

Let f(7,,...,1),) be the target concept where 7),..., 7; are monotone monomials. Let $; 
for 7 € {0,...,k} be the set of those positive examples seen that satisfy exactly j of T,,...,T) 
(if f is the XOR function, for instance, then the sets 5; for even values of j are all empty). We 
will argue by induction on the index 7; in particular we will argue that after the iteration of the 
loop of Learn-k-Term in which r = k — 7, all positive examples @ € S$; have been “captured” by 


(that is, they satisfy) some term in h. 


j=0,r—k: Let # bea positive example that satisfies none of 7,,...,7,. If such an example 
exists, then any other example satisfying none of 7,,...,7; must also be a positive ex- 
ample. There must be some collection of variables a;, € T),...,2;, € Ty (not necessarily 


v4 0, or otherwise # would satisfy some term. 


all different) such that v,;, = 2; 


2 k 


Consider the iteration in which the monomial M is 7;,---%;,. Example ¢v satisfies M 
and so is put into U in step 7. Any other example satisfying M cannot satisfy any of 
T,,...,T, (by definition of a;,,...,2;,) and therefore must be positive. So, a term MT 
satisfied by # will be added to A in step 4. 


j>0,r—=k-—j: Let &bea positive example that satisfies exactly 7 of the terms 7,..., 7; for 


convenience, assume U satisfies terms 7).4,,...,7;. Any other example satisfying exactly 
those terms and no others must also be positive. Let 2;, € 7),...,%;, € T, be a collection 
of not necessarily distinct variables such that v;, =...= 0v;, = 0. 


At the iteration in which the monomial M is %;, ---%;,, example @ is put into set U in step 
7 and the term T created is satisfied by @. In fact, 7 also has in it all variables contained 


in the terms 7,4,,..., 7. The reason is as follows: 


Suppose x; is contained in one of 7,4,,...,7; but not in T. Then, there must 
exist some positive example we U such that w; = 0. So, example w fails to 
satisfy at least one of 7.41,..., 7; in addition to not satisfying any of 7,,...,7;. 
But, this means that w satisfies fewer than 7 terms and so must already have 


been removed from P in an earlier iteration by our inductive hypothesis. (Note 
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that it is for this reason that algorithm Learn-k-Term begins with r = k& and 


works down to r = 0.) 


So, any example satisfying MT must satisfy all of 7.41,...,7) (since it satisfies 7) and 
none of T),..., 7). (since it satisfies M) and therefore must be positive. Thus, term MT 


will be added to A in step 9. 


So, we have shown that algorithm Learn-k-Term, on any size input consistent with some 
function f of & monotone monomials over {0,1}", produces a consistent hypothesis of size 


O(n**") in time polynomial in m and n*. a 


Corollary 1 The concept class C, is learnable by DNF in the distribution-free model. 


In fact, if we assume without loss of generality that the target concept c = f(7T\,..., 7h) 
has the property that f(00---0) = 0 (otherwise we will learn €), then we can start algorithm 
Learn-k-Term at r = k — 1 and produce a DNF of only O(n*~') terms instead of one of O(n*) 
terms. So, for example, we can learn a k-term DNF with a DNF hypothesis of O(n*~') terms 
each of size O(n). This differs from the standard procedure of learning k-term DNF, which 
gives a k-CNF of O(n") clauses of size k = O(1). Moreover, if we know that f outputs 0 when 
only a few of its inputs are 1, then we can produce a hypothesis of smaller size. For example, 
if f is the majority function, then we can start Learn-k-Term with r = k/2 and get a DNF of 


only O(n*/?) terms. 


2.3.1 Decision lists 


An alternative way to learn C;, is to learn by the class of k-decision lists (k-DLs).' In fact, the 
proof for Algorithm Learn-k-Term can be modified to show any concept in C;, can be written 
as a k-decision list. In particular, let c= f(7\,...,7,) be some concept in Cy. The decision list 
will consist of rules of the form “if M; then 6;,” where the each MM; will correspond to one of 
the monomials M in algorithm LEARN-k-TERM. 


TA k-dectsion lstis a function of the form: “if M, then 61, else if M2 then bo, else ... else if Mm 
then bm else bm41,” where the M; are monomials of size at most & and the 6; are each either 0 or 1. 


26 Learning functions on k terms 


Let bo be the value of c(a) when a satisfies none of T,,...,7;. Put on the top of the decision 
list all rules of the from “if %;,%;,---%;, then bo,” where v;, € 7),...,%;, € T;,. Let us say that 
a set of rules “captures” an example if the example satisfies the if-portion of one of them. 
Thus, we have now captured all examples that satisfy none of the 7; (and have classified them 
correctly). 

Inductively suppose we have created rules that capture (and correctly classify) all examples 
satisfying 7 — 1 or fewer of the & terms. Append onto the bottom of the decision list the 
following rules. For each subset {Z;,,...,7;,_,} C {,...,7,} such that all examples which 
satisfy exactly the 7 terms remaining are positive, add all rules of the form: “if %j,%), -+-%j,_, 
then 1,” where a;, € T;,,...,;,_, € T;,_,- For each subset {7;,,...,7;,_,} C{Ti,..., 7, } such 
that all examples satisfying exactly the 7 terms remaining are negative, add all rules of the 
form: “if %,%,-+-++%i,_, then 0,” where x, € T),,...,%i,_, © Ty_,- 

Finally, the default case of the decision list is the rule “else b,” where 6 is the classification 
of examples satisfying all the terms 7;. It is clear from the above arguments that this &-decision 
list is logically equivalent to the k-term function. 

The mistake-bound model is a model of learning more stringent than the PAC model; here, 
unlabeled examples are presented to the learner in an arbitrary order, and after each one the 
learner must predict its classification before being told the correct value. The learner is judged 
by the total number of mistakes it makes in such a sequence. Using the halving algorithm [54], 
k-decision lists can be learned in the mistake-bound model with O(n") mistakes. Thus have 


the following theorem: 


Theorem 2 All functions on k terms can be learned in the mistake-bound model with O(n*) 


mistakes, using a representation of k-decision lists. 


In fact, we can learn k-term functions in an “attribute-efficient” sense, where the number 
of mistakes is polynomial in the number of relevant variables (variables that appear in some 
term 7T;) and is only logarithmic in the number of irrelevant variables. This uses a result of 
Littlestone [54] as follows. 

An alternation in a decision list is a pair of adjacent rules such that the boolean classification 


values for the rules differ. By appropriately ordering the rules in the decision list construction 
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above (listing the “negative rules” before the “positive rules” on alternate 7 values) one can 
see that for any k-term function there is a logically equivalent k-decision list with at most & 


alternations. Such a decision list can be thought of as a function in the form: 


if (Mi 1 OR My» OR ... OR Mim) then b,, else if (Ms 1 OR Mo» OR ... OR 
Moy m,) then bs, else ... else if (M,-1,. OR My-12 OR... OR My_im,_,) then 4-1 


else by, 


where 6; = 1 — ),_1. 
Decision lists with small numbers of alternations can be written as linear threshold functions 
over the monomials M;, with not too large integral weights. For instance, if b,_, = 1, & is odd, 


and m is the sum of the m;, the above decision list can be written as: 


(My-ia ct... + Mieiimg ey) — m( Mya +... + Ma-omp_2) 


+ m*(My-31 + ...+ Mi_smy_s) 


_ m*(Myi+...+ Mim) > 0. 


If only r variables are relevant to the k-term function, then the number of rules m is at 
most r*. Therefore, the maximum weight in the threshold function is r*”. 

Littlestone [54] gives an algorithm that can be used to learn such a function, where the 
number of mistakes is at most O((mr*’)? log(n*)) = O(kr?*+?** logn). Thus, if the number r 


of relevant variables is small, this can be a savings in the number of mistakes made. Thus we 


have the following theorem: 

Theorem 3 Any function on k terms can be learned with O(kr2**2** log n) mistakes, where r 
is the number of relevant variables. 

2.4 Hardness results 


In this section, we show that learning the class C;, ; often requires allowing the learning algorithm 


a more expressive hypothesis class than C; ;. In the previous section, we gave an algorithm that 
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learns the concept class of functions on & terms using the hypothesis class of general DNF. On 
the other hand, we now show that when learning the class C;;, if the algorithm must produce 
a hypothesis from the class C;,;, the problem can become NP-hard. In particular, we show 
that for any symmetric function f, learning the class C, ; by hypothesis class Cy» is NP-hard 
except for f € {A, 7A, T, F}. The hardness result completely characterizes the complexity of 
learning C;+ by C,,; for symmetric functions f. For f € {7 F’}, learning C;, + is trivial, and for 
f € {A, 7A}, Cy. is the class of conjunctions or disjunctions respectively, so learning C,,; by 


C;,; is easy by a standard procedure. We show the following: 


Theorem 4 For any symmetric function f on k inputs except for f € {A, 7A, T, F}, learning 
the class Cy» by Cy 1s NP-hard. 


This theorem extends the work of Pitt and Valiant [64], which shows that learning the class 
of k-term DNF formulas is NP-hard if the learner is required to produce a k-term DNF formula. 
Before giving the proof of Theorem 4, we first provide some intuition. For k > 3, the proof 
of Pitt and Valiant is essentially a reduction from graph k-colorability.? Their reduction is as 
follows. Given the graph, they create a variable 2; for each vertex v; € V. They then create 
one positive examples for each vertex so that the example corresponding to vertex i has bit 
2 set to 0 and all other bits set to 1. They also create one negative example for each edge 
such that the example corresponding to edge (i,7) has bits 7 and 7 set to 0 and the other 
bits set to 1. They then show that the set of examples is consistent with a disjunction of k 
terms if and only if G is k-colorable. Their proof does not work for more general symmetric 
functions f of k terms. In particular, when f is a symmetric function other than OR (e.g., 
when the concept class is 4-term exclusive-or formulas), using their reduction it is possible to 
find a formula f(71, 72,...,7) that correctly classifies all positive and negative examples, but 
the corresponding coloring is invalid. The basic problem is that unlike the case of disjunction, 
for arbitrary f, as the number of inputs that are 1 increases, the value of f can switch back 
and forth between 1 and 0. To solve this problem, we introduce enough variables and examples 
for each edge such that 2; and x; are forced to occur in different terms. We can use this 
 2The graph k-colorability problem is: given a graph G = (V,E) and a positive integer k, does there exist a 
function f : V = {1,2,...,k} such that f(u) 4 f(v) whenever (u,v) € E? That is, using at most k colors, is 


it possible to assign a color to each vertex in the graph such that for any edge, its vertices are given different 
colors? 
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technique to reduce graph k-colorablity to learning any symmetric function on & terms (except 
A, vA,T, F). 

To show Theorem 4, we first consider the concept class Cy?” = {f(Ti,..-,Th)} where 
T,,...,T}, are monotone monomials, and show that learning C;"7" by C;°¢" is NP-hard. We 


then give an extension of the argument that shows that learning Cy"?" by C;.,; is NP-hard. This 


implies Theorem 4. 


Theorem 5 For any symmetric function f on k inputs except f € {A, 7A, T, F}, learning the 
class Cp” by Cre? is NP-hard. 


Proof: First note that if k = 2 then the only functions f with f ¢ {A,7A,T, F'} are the 
functions {V,7V,@,7@}. The proof of [64] for 2-term DNF can be applied directly for these 
cases; so, we assume that k > 3. Without loss of generality, we assume that f(& — 1) = 0; that 


is, f outputs 0 when exactly k — 1 of its inputs are 1. Otherwise, we show that learning C;"¢7' 


by Cpe? for fl = f is NP-hard and the result follows. 

The proof is a reduction from graph k-colorability. Given a graph G = (V, FE), we create 
labeled examples over n = |V|+(k — 2)|E| variables such that there exists ¢ € Cj";" consistent 
with these examples if and only if there is a k-coloring of the graph. We assume that G contains 


no isolated vertices since such vertices do not affect the coloring of the graph. 


We denote the n variables as follows. There is one variable x; for each vertex 2 € V, and 


k — 2 variables Wh jy We jy eee WE” for each edge (i,j) € FE. Thus, for each edge (i,j) € E, 
we have a set W;; of k associated variables {@j,%j, Wij, Wij, .. why}. We add the w;,;’s so 


that ultimately any hypothesis consistent with the examples we define must contain x; and 2; 
in different terms if (i,7) € EF. For convenience, we use the following notation to denote an 


example that consists of 1’s in all bits except those specified by a set of variables W. 


e For W a collection of variables, let g(W) be the example # such that #(2) = 0 for « «e W 


and o(#) = 1 for « g W. Recall that ¢(2) is the bit of 7 corresponding to variable x. 


For 1 {1,...,k} and (i,j) € E, let 5S}, = {g(W): W CW,;,|W| = 1}. That is, set 5}, is the 


set of examples # = g(W) for W a subset of size / of the set {a;,a;,wj;, wj;,-- .,wis*}, We 
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now define & sets of examples as follows: 
S* = {S),:(i,7) € E}, 


S° = {57 : (4,7) € EF}, 


st = {55 : (4,7) € E}, 


such that # € S',1 <1 <k, is a positive example if and only if f(& —/) = 1. That is, for each 
edge (7,7) € FE, each S" contains (7) examples corresponding to that edge. Each # € S' has 
exactly | bits set to 0, where the / variables corresponding to these bits are chosen from some 
set W;,;. If f is true when exactly k —/ terms are true (ie. f(k — 1) = 1), then we label all 
vectors is $" as positive examples; otherwise we label them as negative examples. For example, 
if f is the XOR function and k is even, then all examples in $',$°,... are labeled as positive 
and those in $7,.$*,... are labeled as negative. 

We now show that there exist monotone terms 7),7>,...,7, such that f(T), 7o,...,7),) is 


consistent with these examples if and only if there is a k-coloring of the graph G. 


(<) Given a k-coloring of the graph, then for each vertex i which is colored /, place x; in term 
T;. Then for each edge (2,7), variables z; and 2; appear in different terms. Now arbitrarily 
place the remaining k—2 variables associated with this edge (the w, ;’s) into the remaining k —2 
terms such that each term receives exactly one variable. Thus for each edge (7,7), each of the 
associated variables {2;,2;, Wij Ww; 5 .. ., wes 7} occurs in a different term. So for any example 


in $", exactly / terms are false and k —1 terms are true. Since the examples in S' are positive 


exactly when f(k —/) = 1, the concept f(71,7o,...,7),) classifies all examples correctly. 


(=) Suppose we have 7\,79,...,7, such that concept c = f(71,75,...,7,) is consistent with 
all the examples. Now color the vertices by the function y: V > {1,2,...,k} defined by y(%) = 
min {j: variable 2; occurs in term 7;}. Lemma 1 guarantees we have a well defined function, 


and Lemma 2 gives us a valid coloring. a 


Lemma 1 Lach variable x; occurs in some term. 


2.4 Hardness results 31 


Proof: Suppose that some x; does not occur in any term. Let ¢g = min {/: f(kK-1) = 1 andl > 
0}. That is, g is the smallest positive number of terms that can be false such that concept c is 
true. Note that qg is the least index such that c(#) = 1 for ¢ € S’. We know that q¢ exists for 
f ¢ {AND, FALSE}. 


Pick 7 such that (7,7) € F (since we assumed that the graph is connected, we know some 


foo. . pe + ~2 

such j exists). Now consider the positive example @ = g({#;,2j, w;;,W7;,.--,wi; }). Ifa; does 
. = 9 . 

not occur in any term, then @ = g({2j, wij, w7,;,---, wij }) satisfies the same number of terms 


as 0, and thus c(#) = c(#) = 1. But @ belongs to S!~', and we know all examples in $!~! are 


negative examples by our definition of q (5° is our first set of positive examples). Contradiction. 


Lemma 2 /[f (i,j) € E then x; and x; never occur in the same term. 


Proof: Suppose that for (i,j) € EF, variables 2; and 2; occur in the same term. Again, let 


us look at vectors in S$? where g = min {I: f(k —1) = 1 and/ > 0}. In particular, consider 


1 


2 
igo 


the positive example @ = g({2;,2;,w pres .wh*}). By Lemma 3, we know that exactly q¢ 


terms of c are not satisfied by #. Then we know that each of these g terms must contain at least 


1 2 


. 2 . 
one variable of {2;,2j,wj;,W7;,-.-, wi; }. Ife, and x; occur in the same term, then we know 


we 


that some variable # € {#;,2j, wij, Wi ;5-- .wi*} occurs in at least two terms. Let r be the 


ij? 
number of terms that variable z appears in. We build a set 5 of at most g—r+1 variables such 
that @ = g(S) also makes q terms false. Initially let S = {2}. Then for each of the remaining 
q—r terms not satisfied by #, place into $ some variable from {2;, 2;, We js WwW? 5 .. wis} which 
appears in that term. Now consider example w#. The terms not satisfied by w@ are exactly those 
not satisfied by @, so c(w@) = c(@) = 1. Moreover, since $ C {#;,2j, Wj) ;, Wj ;,-- wis} CW; ;, 


example @ must lie in some set $' where | < g. But S% is our first set (the set of least index) 


of positive examples, so uv must be negative. Contradiction. | 


Lemma 3 Exactly q terms of c are not satisfied by v. 


Proof: Suppose not. That is, suppose r # q terms of ¢ are not satisfied by v Since @ is a 


positive example, f(& —7r) = 1 and by definition of g we have r > q. There are now two cases: 
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Case 1: f(k —1) = 1 for all JE {q,¢41,...,r}. 


By definition of q, for any set SC {aj, aj, Wij, W7j5-- wis} of size q—1, c(g(S)) = 0. This 
implies that each @ = g($) satisfies at least r — q+ 1 more terms of {T),...,7),} than does 
®. But this requires each variable in {255 0j, Wz; W; 5, .. wis ?} to appear without any other 


variable from this set in r —q+1 terms. So there must exist g(r — q+ 1) terms not satisfied by 


8. Since r > gq and q 4 1 (we know f(k —q) = 1 but f(k — 1) = 0), we have: 


r(q-1) > qdq-1) 
rq-T > | -| 


qr-q+t1) > 7 


Thus, more than r terms are not satisfied by @ Contradiction. 


Case 2: f(k —1) = 0 for some / € {g4+1,...,r—1}. 


Consider the sequence of examples: 


Vy = g{vi, vj, wy; We ;,...,wi5"}), 
Br = ges, xj, wij, wij. wiz }), 
Be = Ais BI, Wye Wi yyy WZ ”})- 


We assign values to q;,r;, and J; which maintain the following invariants: g; < |; < r; and 
f(k-—q) = f(k-—7;) and f(k-—q) # f(k —1;). Initially let q@ = ¢, 7. = r, and , = 1. 
Initially, positive example v,, fails to satisfy r, terms and there exists 1; between q, and r, with 
f(k —l,) = 0. Thus negative example 2%, must fail to satisfy some rz > r; terms. Now let 
gg = 1, and I, = r,, and so we have f(k — q2) = f(k— re) = 0, f(k — le) = 1, and q@ < Ip < ro. 
Thus we know that positive example v), must satisfy some r3 > rz terms. Letting q3 = ly and 
lz = rg, and continuing in this fashion, we find an increasing sequence 4, go, 93,---, such that 


each example v,, fails to satisfy r; > q; terms. At q; = k, we have a contradiction. | 
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We have now finished proving Theorem 5. We now extend the proof to the general case in 


which the terms 7),...,7, may be non-monotone. 


Proof of Theorem 4: We show that Cj°"" by C;,; is NP-hard. This implies the theorem. 
Given a graph G = (V, EF) we create a new graph G’ consisting of k + 1 copies G,,...,G@r4i 
of G. Clearly G’ is k-colorable if and only if G is. We define examples in the same way as in the 
proof of Theorem 5. We must now show that there exist (non-monotone) terms 7,,7,..., 7), 
such that f(7,,7T>,...,7)) is consistent with the examples if and only if there is a k coloring 
of the graph G. Given a k-coloring of the graph G, we can easily find a k-coloring of graph 
G’. From this coloring, we can find k terms such that f(71,7>,...,7)) is consistent with the 
examples, using the same method as in the proof of Theorem 5. For the other direction, we 
must show that if there are non-monotone terms 7\,...,7; such that f(71,...,7),) is consistent 
with the examples, then G is k-colorable. Notice that if any term 7; has in it a negated variable 
corresponding to a vertex or edge of some graph G,, then 7; is not satisfied by any example 
corresponding to graph G, for r 4 q. If term 7; has in it negated variables from more than one 
graph G,, then no examples satisfy term 7;, and thus the concept is equivalent to the concept 
with term 7; replaced by 0. If 7; contains negated variables corresponding to a vertex or edge 
of just one graph G,, then we can replace term 7; by 0 and mark graph G,; this new concept is 
still consistent with the examples corresponding to all unmarked graph copies. We continue this 
procedure until all terms left have no negated variables. We never mark all the graph copies 
since we mark at most one graph for each term that is set to 0, and there are more graphs than 
terms. So, since each term left has no negated variables we can color any one of the remaining 


unmarked graphs using the coloring given in the proof of Theorem 5. a 


2.5 Conclusion 


We present an algorithm that learns the class C; of all concepts f(7T,,...,7)) where f is a {0, 1}- 
valued function and 7,,..., 7), are monomials, using a hypothesis class of general DNF. We also 
show that learning the class Cy, ; by Cy; where f is a symmetric function is NP-hard, except 
for f € {A,7A,T, F'} for which learning is easy. We leave as open the problem of classifying 


the learnability of Cy» by Cy; for more general functions f. 


CHAPTER 8 


Piecemeal learning of unknown 
environments 


3.1 Introduction 


We address the situation where a robot, to perform a task better, must learn a complete map of 
its environment. The robot’s goal is to learn this map while satisfying the piecemeal constraint 
that learning must be done “a piece at a time.” Why might mobile robot exploration be done 
piecemeal? Robots may have limited power, and after some exploration they may need to 
recharge or refuel. In addition, robots may explore environments that are too risky or costly for 
humans to explore, such as the inside of a volcano (e.g., CMU’s Dante II robot), or a chemical 
waste site, or the surface of Mars. In these cases, the robot’s hardware may be too expensive 
or fragile to stay long in dangerous conditions. Thus, it may be best to organize the learning 
into phases, allowing the robot to return to a start position for refueling and maintenance. 
The “piecemeal constraint” means that each of the robot’s exploration phases must be of 
limited duration. We assume that each exploration phase starts and ends at a fixed start 
position. This special location might be a refueling station or a base camp. Between explo- 
ration phases the robot might perform other unspecified tasks. Piecemeal learning thus enables 
“learning on the job”, since the phases of piecemeal learning can help the robot improve its 
performance on the other tasks it performs. This is the “exploration/exploitation tradeoff”: 


spending some time exploring (learning) and some time exploiting what one has learned. 
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The piecemeal constraint can make efficient exploration surprisingly difficult. We first con- 
sider piecemeal learning in environments that can be modeled as grid graphs with rectangular 
obstacles. For these environments, we give two linear-time algorithms. The first algorithm, 
the “wavefront” algorithm, can be viewed as an optimization of breadth-first search for our 
problem. The second algorithm, the “ray” algorithm, can be viewed as a variation on depth- 
first search. We then extend these results by giving a nearly linear algorithm for piecemeal 
learning more complicated environments that can be modeled by arbitrary undirected graphs. 
For piecemeal learning of these environments, we give some “approximate” breadth-first search 
algorithms. We first give a simple algorithm that runs in O( + V'°) time. We then improve 
this algorithm and give a nearly linear time algorithm: it achieves O(F+V!+°™) running time. 
An interesting open problem is whether arbitrary, undirected graphs can be learned piecemeal 
in linear time. 

We now give a brief summary of the rest of this chapter. Section 3.2 gives some related 
work on environment learning and mobile robot navigation. Section 3.3 formalizes our model. 
Section 3.4 discusses piecemeal learning of arbitrary graphs, and the problems with some initial 
approaches. Section 3.5 gives an approximate solution to the off-line version of this problem. In 
addition, it gives our strategy for solving the problem we are interested in (the on-line version 
of the problem). Section 3.6 introduces the notion of “city-block” graphs, discusses shortest 
paths in such graphs, and gives two linear time algorithms for piecemeal learning these types 
of graphs. Section 3.7 considers piecemeal learning of general graphs, and gives a nearly linear 
algorithm for this problem. Section 3.8 gives an application of our algorithms to the problem 
of finding a treasure in an unknown, potentially infinite graph. Finally, Section 3.9 concludes 


with some open problems. 


3.2 Related work 


Theoretical approaches to environment learning differ in how the robot’s environment is mod- 
eled, what types of sensors the robot has, the accuracy of the robot’s sensor, if the robot has 
access to a teacher, and what the performance measure is. The robot’s environment is often 
modeled by a finite automaton, a directed graph, an undirected graph, or some special case of 


the above. Typically, it is assumed that the robot knows what type of environment it is trying 
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to learn. The robot may have vision, or may have no long-range sensors whatsoever. Sometimes 
the robot is assumed to have accurate sensors, and in other models the robot’s sensors may be 
noisy. Performance measures for the robot’s accuracy vary from requiring the robot to always 
output an exact map of the environment, to requiring that the robot output a good map with 
high probability. Performance in terms of efficiency can be judged by either the total number 
of steps taken by the robot, the number of queries the robot may have to ask of a teacher, 
competitive ratios (e.g., the total number of steps the robot makes divided by the minimum 
number of steps required had the robot known the environment), or some other measure. 

Rivest and Schapire [70] study environments that can be modeled by a strongly connected 
deterministic finite automata. The robot gets information about the automaton by actively 
experimenting in the environment and by observing input-output behavior. Rivest and Schapire 
show that a robot with a teacher can with high probability learn such an environment. They 
use homing sequences to improve Angluin’s algorithm [1] to learn without using a “reset” 
mechanism. Ron and Rubinfeld [71] further extend this result by giving an efficient algorithm 
that with high probability learns finite automata with small cover time, without requiring a 
teacher. Dean et al. [33] study the problem of learning finite automaton when the output at 
each state has some probability of being incorrect. They give an algorithm for learning finite 
automata, assuming that the robot has access to a distinguishing sequence. Freund et al. [43] 
give algorithms for learning “typical” deterministic finite automata from random walks. 

Deng and Papadimitriou [35] and Betke [16] model the robot’s environment as a directed 
eraph, with distinct and recognizable vertices and edges. They give a learning algorithm with 
a constant competitive ratio when the graph is Eulerian or when the deficiency of the graph 
is 1. For general graphs, they give a competitive ratio that is exponential in the deficiency of 
the graph. Bender and Slonim [11] look at the more complicated case of directed graphs with 
indistinguishable vertices. They show that a single robot with a constant number of pebbles 
cannot learn such environments without knowing the size of the graph. On the other hand, 
they give a probabilistic algorithm for two cooperating robots to learn such an environment. 
Dudek et al. [38] study the easier problem of learning undirected graphs with indistinguishable 
vertices, and give an algorithm for a robot with one or markers to learn such an environment. 


Deng, Kameda, and Papadimitriou [34] model environments such as “rooms” as polygons 
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with polygonal obstacles. They assume the robot has vision, and must learn a map of the 
room. They show that if the polygon has an arbitrary number of polygonal obstacles in it, 
then then it is not possible to achieve a constant competitive ratio. For the simplified case of 
a rectilinear room with no obstacles, they show a 2\/2 competitive algorithm for learning the 
room. Kleinberg [52] improves this to a 2/2 competitive algorithm. For a rectilinear room 
with at most & obstacles, Deng et al. give an algorithm with O(k) competitive ratio. They also 
give constant competitive algorithms for environments that are modeled by general polygons 
with a bounded number of obstacles, but the constant they give is large. 

There has also been much theoretical work in the case where the robot’s goal is to get from 
one point to another in an unknown environment. The robot learns parts of the environment 
as it is navigating, but its primary goal is to reach a particular location. In some cases, the 
robot knows exactly where there the goal location is, and in others it is assumed that the robot 
will recognize the goal location. 

Baeza- Yates, Culberson and Rawlins [8] study the cow path problem. The robot must search 
for an object in an unknown location on 2 or more rays (the endpoints of the rays are at some 
fixed start position). They give an optimal deterministic strategy for this problem. For the 
case of 2 rays, they use a doubling strategy and get a competitive ratio of 9; they extend this 
technique for m rays and get a competitive ratio of 1 + 2(m™/(m— 1)™~1). Kao, Reif and 
Tate [49] give a randomized algorithm for this problem that has better expected performance 
than any deterministic algorithm. Kao, Ma, Sipser and Yin [48] give an optimal deterministic 
search strategy for the case of multiple robots. 

Papadimitriou and Yanakakis [62] consider the problem of a robot with vision moving around 
in a plane filled with obstacles. The robot does not know its environment, but knows its exact 
absolute location at all times, as well as its start position and its goal position. The robot’s 
goal is to travel from the start position to the goal position. Papadimitriou and Yanakakis show 
that for the case of non-touching axis parallel rectangular obstacles, the competitive ratio is 
Q(./n), where n is the length of the shortest path between the start and goal locations. For 
the case of square obstacles, they give a 4/26 1.7 competitive algorithm, and show that any 

3 


strategy must have competitive ratio greater than 5. 


Blum, Raghavan, and Schieber [22] also study the problem of point to point navigation in 
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an unknown two-dimensional geometric environment with convex obstacles. For the case of axis 
parallel rectangular obstacles, they give an algorithm with competitive ratio O(,/n), matching 
the lower bound of Papadimitriou and Yanakakis. They also introduce and give an algorithm 
for the reom problem, where the goal of the robot is to go from a point on a wall of the room 
to a specified point in the center of the room. The room contains axis parallel obstacles, but 
the obstacles do not touch the sides of the wall. Bar-Eli, Berman, Fiat, and Yan [10] show that 
any algorithm for this problem has competitive ratio Q(log n), and give an algorithm attaining 
this bound. 

Blum and Chalasani [21] consider the point to point problem in an unknown environment 
when the robot makes repeated trips between two points. The goal of the robot is to find better 
paths in each trip. In environments with axis parallel obstacles, they give an algorithm with 
the property that at the i-th trip, the robot’s path is O(,/n/7) times the shortest path length. 

Klein [51] considers the problem of a polygon with distinguished start and goal vertices. 
The robot’s goal is to walk inside the polygon from the start location to the goal location. The 
goal location is recognized as soon as the robot sees it. For a special type of polygon known 
as a street, Klein gives an algorithm with a 1 + 27 = 5.71 competitive ratio. Kleinberg [52] 
improves this by giving an algorithm with competitive ratio \/4 + /8 = 2.61. For rectilinear 
streets, the algorithm achieves a competitive ratio of V2. 

There are many other related papers in the literature, particularly in the area of robotics 
(e.g., [57]) and maze searching (e.g., [25, 24]). Rao, Kareti, Shi, and Iyengar [68] give a survey 


of work on robot navigation in unknown terrains. 


3.3 Formal model 


We model the robot’s environment as a finite connected undirected graph G = (V, £) with dis- 
tinguished start vertex s. Vertices represent accessible locations. Edges represent accessibility: 
if {z,y} € FE then the robot can move from 2 to y, or back, in a single step. 

We assume that the robot can always recognize a previously visited vertex; it never confuses 
distinct locations. At any vertex the robot can sense only the edges incident to it; it has no 
vision or long-range sensors. The robot can distinguish between incident edges at any vertex. 


Each edge has a label that distinguishes it from any other edge. Without loss of generality, 
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we can assume that the edges are ordered. At a vertex, the robot knows which edges it has 
traversed already. The robot only incurs a cost for traversing edges; thinking (computation) is 
free. We also assume a uniform cost for an edge traversal. We consider the running time of a 
piecemeal learning algorithm to be the number of edge traversals made by the robot. 

The robot is given an upper bound B on the number of steps it can make (edges it can 
traverse) in one exploration phase. In order to assure that the robot can reach any vertex in 
the graph, do some exploration, and then get back to the start vertex, we assume B allows for 
at least one round trip between s and any other single vertex in G, and also allows for some 
number of exploration steps. More precisely, we assume B = (2+ a)r, where a > 0 is some 
constant, and r is the radius of the graph (the maximum of all shortest-path distances between 
s and any vertex in G). 

Initially all the robot knows is its starting vertex s, the bound B, and the radius r of the 
eraph. The robot’s goal is to explore the entire graph: to visit every vertex and traverse every 


edge, minimizing the total number of edges traversed. 


3.4 Initial approaches to piecemeal learning 


A simple approach to piecemeal learning of arbitrary undirected graphs is to use an ordinary 
search algorithm—breadth-first search (BFS) or depth-first search (DFS)—and just interrupt 
the search as needed to return to visit s. (Detailed descriptions of BFS and DFS can be found 
in algorithms textbooks [32].) Once the robot has returned to s, it goes back to the vertex at 
which search was interrupted and resumes exploration. We now illustrate the problems each of 


these approaches has for efficient piecemeal learning. 


Depth-first search 


In depth-first search, edges are explored out of the most recently discovered vertex v that still has 
unexplored edges leaving it. When all of v’s edges have been explored, the search “backtracks” 
to explore edges leaving the vertex from which v was discovered. This process continues until all 
edges are explored. This search strategy, without interruptions due to the piecemeal constraint, 


is efficient since at most 2|F| edges are traversed. Interruptions, or exploration in phases of 
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limited duration, complicate matters. For example, suppose in the first phase of exploration, at 
step B/2 of a phase the robot reaches a vertex v as illustrated in Figure 3.1. Moreover, suppose 
that the only path the robot knows from s to v has length B/2. At this point, the robot must 
stop exploration and go back to the start location s. In the second phase, in order for the robot 
to resume a depth-first search, it should go back to v, the most recently discovered vertex. 
However, since the robot only knows a path of B/2 to v, it cannot proceed with exploration 


from that point. 


Nn 


B/2 


Figure 3.1: The robot reaches vertex v after B/2 steps in a depth-first search. Then it must 
interrupt its search and return to s. It cannot resume exploration at v to get to vertex w, 
because the known return path is longer than B/2, the remaining number of steps allowed in 
this exploration phase. DFS fails. 


Since DFS with interruptions fails to reach all the vertices in the graph, another approach 
to solve the piecemeal learning problem would be to try a bounded depth-first search strategy. 
In bounded DFS, edges are explored out of the most recently discovered vertex v which had 
depth less than a given bound 8. However, a straightforward bounded DFS strategy also does 


not translate into an efficient piecemeal learning algorithm for arbitrary undirected graphs. 


Breadth-first search 


Unlike depth-first search, breadth-first search with interruptions does guarantee that all vertices 
in the graph are ultimately explored. Whereas a DFS strategy cannot resume exploration at 
vertices to which it only knows a long path, a BFS strategy can always resume exploration. 
This is because BFS ensures that the robot always knows a shortest path from s to any explored 
vertex. However, since a BFS strategy explores all the vertices at the same distance from s 
before exploring any vertices that are further away from s, the resulting algorithm may not be 
efficient. Note that in the usual BFS model, the algorithm uses a queue to keep track of which 


vertex it will search from next. Thus, searching requires extracting a vertex from this queue. 
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In our model, however, since the robot can only search from its current location, extracting a 
vertex from this queue results in a relocation from the robot’s current location to the location 
of the new vertex. Unlike the standard BFS model, our model does not allow the robot to 
“teleport” from one vertex to another; instead, we consider a teleport-free exploration model, 
where the robot must physically move from one vertex to the next. 

In BFS, the robot may not move further away from the source than the unvisited vertex 
nearest to the source. At any given time in the algorithm, let A denote the shortest-path 
distance from s to the vertex the robot is visiting, and let 6 denote the shortest-path distance 
from s to the vertex nearest to s that is as yet unvisited. With traditional breadth-first search 
we have A < é at all times. With teleport-free exploration, it is generally impossible to maintain 


A <6 without a great loss of efficiency: 


Lemma 4 A robot which maintains A < 6 (such as a traditional BFS) may traverse Q(E*) 


edges. 
Proof: Consider the graph in Figure 3.2, where the vertices are {—n,-n 4 1,..., -l,s = 
0,1,2,...,2 —1,n}, and edges connect consecutive integers. To achieve A < 6, a teleport-free 


BFS algorithm would run in quadratic time, traveling back and forth from 1 to —1 to —2 to 2 


to3.... | 


-4 -3 -2 -1 0 1 2 3 4 


Figure 3.2: A simple graph for which the cost of BFS is quadratic in the number of edges. 


3.5 Our approaches to piecemeal learning 


In this section, we discuss our approach to piecemeal learning of general graphs. First we 


define the off-line version of this problem, and give an approximate solution for it, and then we 
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give a general method for converting certain types of search algorithms into piecemeal learning 


algorithms. 


3.5.1 Off-line piecemeal learning 


We now develop a strategy for the off-line piecemeal learning problem which we can adapt to 
get a strategy for the on-line piecemeal learning problem. 

In the off-line piecemeal learning problem, the robot is given a finite connected undirected 
graph G = (V,F), a start location s € V, and a bound B on the number of edges traversed in 
any exploration phase. The robot’s goal is to plan an optimal search of the graph that visits 
every vertex and traverses every edge, and also satisfies the piecemeal constraint (i.e., each 
exploration phase traverses at most B edges and starts and ends at the start location). Note 
that since the graph is given, the problem does not actually have a learning or exploration 
component. However, for simplicity we continue using “learning” and “exploration.” 

The off-line piecemeal learning problem is similar to the well-known Chinese Postman Prob- 
lem [39], but where the postman must return to the post-office every so often. (We could call 
the off-line problem the Weak Postman Problem, for postmen who cannot carry much mail.) 
The same problem arises when many postmen must cover the same city with their routes. 

The Chinese Postman Problem can be solved by a polynomial time algorithm if the graph 
is either undirected or directed [39]. The Chinese Postman problem for a mixed graph that has 
undirected and directed edges was shown to be NP-complete by Papadimitriou [61]. We do not 
know an optimal off-line algorithm for the Weak Postman Problem; this may be an NP-hard 
problem. 

We now give an approximation algorithm for the off-line piecemeal learning problem using 


a simple “interrupted-DFS” approach. 


Theorem 6 There exists an approximate solution to the off-line piecemeal learning problem 


for an arbitrary undirected graph G = (V,E) which traverses O(|E|) edges. 


Proof: Assume that the radius of the graph is r and that the number of edges the robot is 
allowed to traverse in each phase of exploration is B = (2+ a)r, for some constant a such that 


ar is a positive integer. Before the robot starts traversing any edges in the graph, it looks at 


44 Piecemeal learning of unknown environments 


the graph to be explored, and computes a depth-first search tree of the graph. A depth-first 
traversal of this depth-first search tree defines a path of length 2|F| which starts and ends at s 
and which goes through every vertex and edge in the graph. The robot breaks this path into 
segments of length ar. The robot also computes (off-line) a shortest path from s to the start 
of each segment. 

The robot then starts the piecemeal learning of the graph. Each phase of the exploration 
consists of taking a shortest path from s to the start of a segment, traversing the edges in the 
segment, and taking a shortest path back to the start vertex. For each segment, the robot 
traverses at most 2r edges to get to and from the segment, and ar edges to explore the segment 
itself. Thus, since the total number of edge traversals for each segment is at most (2+a)r = B, 
the piecemeal constraint is satisfied. Since there are EI segments, there are EI —1 


interruptions, and the number of edge traversals due to interruptions is at most: 


2|E| 2|E| 
—-1l)2r < 2r 
ar ar 
A|E| 
a 
Thus the total number of edge traversals is at most (4/a + 2)|E| = O(£). a 


3.5.2 On-line piecemeal learning 


We now show how we can change the strategy outlined above to obtain an efficient on-line 
piecemeal learning algorithm. 

We call an on-line search optimally interruptible if it always knows a shortest path back 
to s that can be composed from the edges that have been explored. We refer to a search as 
efficiently interruptible if it always knows a path back to s via explored edges of length at most 


the radius of the graph. 


Theorem 7 An efficiently interruptible algorithm for exploring an unknown graph G = (V, E) 
with n vertices and m edges that takes time T(n,m) can be transformed into a piecemeal learning 


algorithm that takes time O(T(n,m)). 


Proof: The proof of this theorem is similar to the proof of Theorem 6. However, there are a 
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few differences. Instead of using an ordinary search algorithm (like DFS) and interrupting as 
needed to return to s, we use an efficiently interruptible search algorithm. Moreover, the search 
is on-line and is being interrupted during exploration. Finally, the cost of the search is not 2|F| 
as in DFS, but at most T(n,m). 

Assume that the radius of the graph is r and that the number of edges the robot is allowed 
to traverse in each phase of exploration is B = (2+ a)r, for some constant a such that ar is a 
positive integer. In each exploration phase, the robot will execute ar steps of the original search 
algorithm. At the beginning of each phase the robot goes to the appropriate vertex to resume 
exploration. Then the robot traverses ar edges as determined by the original search algorithm, 
and finally the robot returns to s. Since the search algorithm is efficiently interruptible, the 
robot knows a path of distance at most r from s to any vertex in the graph. Thus the robot 
traverses at most 2r + ar = B edges during any exploration phase. 

Since there are pF) segments, there are pF) — | interruptions, and the number of 
edge traversals due to interruptions is: 


(2) < Mam 


ar 


Thus, the total number of edge traversals is T(n,m) + 2T(n,m)/a = T(n,m)(1+4+ 2/a) = 
O(T(n, m)). 


For arbitrary undirected planar graphs, we can show that any optimally interruptible search 
algorithm requires 0(|E|?) edge traversals in the worst case. For example, exploring the graph 
in Figure 3.2 (known initially only to be an arbitrary undirected planar graph) would result in 
||? edge traversals if the search is required to be optimally interruptible. 

Because it seems difficult to handle arbitrary undirected graphs efficiently, we first focus 
our attention on a special class of undirected planar graphs. These graphs, known as city- 
block graphs, are defined in the Section 3.6.1. For these graphs we present two efficient O(|F]) 
optimally interruptible search algorithms. Since an optimally interruptible search algorithm is 
also an efficiently interruptible search algorithm, these two algorithms give efficient piecemeal 


learning algorithms for city-block graphs. The wavefront algorithm is a modification of breadth- 
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first search that is optimized for city-block graphs. The ray algorithm is a variation on depth- 
first search. For piecemeal learning arbitrary undirected graphs, since optimally interruptible 
search algorithms are not efficient, we look at efficiently interruptible search algorithms. In 


particular, our algorithms are approximate breadth-first search algorithms. 


3.6 Linear time algorithms for city-block graphs 


This section first defines and motivates the class of city-block graphs, and then develops some 
useful properties of such graphs that will be used in Subsections 3.6.2 (which gives the wavefront 
algorithm for piecemeal learning of a city-block graph) and 3.6.3 (which gives the ray algorithm). 

Both the wavefront algorithm and the ray algorithm are optimally interruptible, and thus 
maintain at all times knowledge of a shortest path back to s. Since BFS is optimally inter- 
ruptible, we study BFS in some detail to understand the characteristics of shortest paths in 
city-block graphs. Our algorithms depend on the special properties that shortest paths have 
in city-block graphs. We also study BFS because our wavefront algorithm is a modification of 


BFS. 


3.6.1 City-block graphs 


We model environments such as cities or office buildings in which efficient on-line robot nav- 
igation may be needed. We focus on grid graphs containing some non-touching axis-parallel 
rectangular “obstacles”. We call these graphs city-block graphs. They are rectangular planar 
graphs in which all edges are either vertical (north-south) or horizontal (east-west), and in which 
all faces (city blocks) are axis-parallel rectangles whose opposing sides have the same number 
of edges. A 1x 1 face might correspond to a standard city-block; larger faces might correspond 
to obstacles (parks or shopping malls). Figure 3.3 gives an example. City-block graphs are also 
studied by Papadimitriou and Yanakakis [62], Blum, Raghavan, and Schieber [22], and Bar-Fli, 
Berman, Fiat and Yan [10]. 

An m xX n city-block graph with no obstacles has exactly mn vertices (at points (7,7) for 
Ll<ti<m,1<j <n) and 2mn —- (m+n) edges (between points at distance 1 from each 


other). Obstacles, if present, decrease the number of accessible locations (vertices) and edges 
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in the city-block graph. In city-block graphs the vertices and edges are deleted such that all 
remaining faces are rectangles. 


We assume that the directions of incident edges are apparent to the robot. 


| 


a 


oa 


Figure 3.3: A city-block graph with distinguished start vertex s. 


[asses 


Let 6(v,v’) denote the length of the shortest path between v and v’, and let d[v] denote 


6(v, s), the length of the shortest path from v back to s. 


Monotone paths and the four-way decomposition 


A city-block graph can be usefully divided into four regions (north, south, east, and west) by four 
monotone paths: an east-north path, an east-south path, a west-north path, and a west-south 
path. The east-north path starts from s, proceeds east until it hits an obstacle, then proceeds 
north until it hits an obstacle, then turns and proceeds east again, and so on. The other paths 
are similar (see Figure 3.4). Note that all monotone paths are shortest paths. Furthermore, 
note that s is included in all four regions, and that each of the four monotone paths (east-north, 
east-south, west-north, west-south) is part of all regions to which it is adjacent. 

In Lemma 5 we show that for any vertex, there is a shortest path to s through only one 
region. Without loss of generality, we therefore only consider optimally interruptible search 
algorithms that divide the graph into these four regions, and search these regions separately. 


We only discuss what happens in the northern region; the other regions are handled similarly. 
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Figure 3.4: The four monotone paths and the four regions. 


Lemma 5 There exists a shortest path from s to any point in a region that only goes through 


that region. 


Proof: Consider a point v in some region A. Let p be any shortest path from s to the point 
v. If pis not entirely contained in region A, we can construct another path p’ that is entirely 
contained in region A. We note that the vertices and edges which make up the monotone paths 
surrounding a region A are considered to be part of that region. 

Since path p starts and ends in region A but is not entirely contained in region A, there 
must be a point uw that is on p and also on one of the monotone paths bordering A. Note that 
u may be the same as v. Without loss of generality, let u be the last such point, so that the 
portion of the path from u to v is contained entirely within region A. Then the path p’ will 
consist of the shortest path from s to u along the monotone path that u is on, followed by the 
portion of p from u to v. This path p’ is a shortest path from s to v because p was a shortest 


path and p’ can be no longer than p. | 


Canonical shortest paths of city-block graphs 


We now make a fundamental observation on the nature of shortest paths from a vertex v back 
to s. In this section, we consider shortest paths in the northern region; properties of shortest 


paths in other region are similar. 
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Lemma 6 For any vertex v in the northern region, there is a canonical shortest path from v to 
the start vertex s which goes south whenever possible. The canonical shortest path goes east or 
west only when it is prevented from going south by an obstacle or by the monotone path defining 


the northern region. 


Proof: We call the length d[v] of the shortest path from v to s the depth of vertex v. We show 
this lemma by induction on the depth of a vertex. 

For the base case, it is easy to verify that any vertex v such that d[v] = 1 has a canonical 
shortest path that goes south whenever possible. 

For the inductive hypothesis, we assume that the lemma is true for all vertices that have 
depth t—1, and we want to show it is true for all vertices that have depth ¢. Consider a vertex p 
at depth ¢. If there is an obstacle obstructing the vertex that is south of point p or if p is on a 
horizontal segment of the monotone path defining the northern region, then it is impossible for 
the canonical shortest path to go south, and the claim holds. Thus, assume the point south of 
pis not obstructed by an obstacle or by the monotone path defining the northern region. Then 


we have the following cases: 


Case 1: Vertex p, directly south of p has depth ¢ — 1. In this case, there is clearly a 
canonical shortest path from p to s which goes south from p to p, and then follows the 


canonical shortest path of p,, which we know exists by the inductive assumption. 


Case 2: Vertex p, directly south of p has depth not equal to t—1. Then one of the 
remaining adjacent vertices must have depth t — 1 (otherwise it is impossible for p to have 
depth t). Furthermore, none of these vertices has depth less than ¢ — 1, for otherwise 


vertex p would have depth less than f. 


Note that the point directly north of p cannot have depth t — 1. If it did, then by the 
inductive hypothesis, it has a canonical shortest path which goes south. But then p has 


depth ¢ — 2, which is a contradiction. 


Thus, either the point west of p or the point east of p has depth t — 1. Without loss of 
generality, assume that the point p,, west of p has depth ¢— 1. We consider two subcases. 
In case (a), there is a path of length 2 from p, to p, that goes south one step from pw, 


and then goes east to p,. In case (b), there is no such path. 
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Case (a): If there is such a path, the vertex directly south of p, exists, and by the 
inductive hypothesis has depth t — 2 (since there is a canonical shortest path from 
Py to s of length ¢— 1, the vertex directly to the south of p, has depth t — 2). Then 
ps, Which is directly east of this point, has depth at most ¢t — 1 and thus there is a 


canonical path from p to s which goes south whenever possible. 


Case (b): Note that the only way there does not exist a path of length 2 from p,, to 
ps (other than the obvious one through p) is if p is a vertex on the northeast corner 
of an obstacle which is bigger than 1x1. Suppose the obstacle is k, x ko, where ky, is 
the length of the north (and south) side of the obstacle, and kz is the length of the 
east (and west) side of the obstacle. We know by the inductive hypothesis that the 
canonical shortest path from p, goes either east or west along the north side of this 
obstacle, and since the vertex p has depth ¢ we know that the canonical shortest path 
goes west. After having reached the corner, the canonical shortest path from p,, to s 
proceeds south. Thus, the vertex which is on the southwest corner of this obstacle 
has depth / = t—1—(k,—1)—kze. If we go from this vertex to p, along the south side 
of the obstacle and then along the east side of the obstacle, then the depth of point 
ps, is at most 1+ k, + (kg -1) = t—1. Thus, in this case there is also a canonical 


path from p to s which goes south whenever possible. 


Lemma 7 Consider adjacent vertices v and w in a city-block graph where v is north of w. In 


the northern region, without loss of generality, dlv] = d[w]+ 1. 


Proof: The proof follows immediately from Lemma 6. | 


Lemma 8 Consider adjacent vertices v and w in a city-block graph where v is west of w. In 


the northern region, without loss of generality, dv] = d[w]+ 1. 


Proof: We prove the lemma by induction on the y-coordinate of the vertices in the northern 


region. If » and w have the same y-coordinate as s, then we know that d[v] = d[w]+4 1 if s is 
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east of v and d[v] = d[w] — 1 if s is west of w. Assume that the claim is true for vertices v 
and w with y-coordinate k. In the following we show that it is also true for vertices v and w 
with y-coordinate & + 1. We distinguish the case that there is no obstacle directly south of v 


and w from the case that there is an obstacle directly south of v or w. 


Case 1: If there is no obstacle directly south of v and w, or there a 1 x 1 obstacle with wu 


and w on the north side, the lemma follows by Lemma 7 and the induction assumption. 


Case 2: If there is an obstacle directly south of v or w, then we assume without loss of 
generality that both v and w are on the boundary of the north side of the obstacle. (Note 


that v or w may, however, be at a corner of the obstacle.) 


If the lemma does not hold it means that d[v] = d[w] for two adjacent vertices v and w 
(because, in any graph, the d values for adjacent vertices can differ by at most one). This 
would also mean that all shortest paths from v to s must go through vertex v,, at the 
north-west corner of the obstacle and all shortest paths from w to s must go through 
vertex v, at the north-east corner of the obstacle (v,, may be the same as v, and v, may 
be the same as w). However, we next show that there is a grid point m on the boundary 
of the north side of the obstacle that has shortest paths through both v, and v,. The 


claim of Lemma 8 follows directly. 


The distance « between m and wv, can be obtained by solving the following equation: 
«+ div] = (k — «)+d[v.] where k is the length of the north side of the obstacle. The 
distance x is (k + d[v.]—d[v,])/2. Using the inductive hypothesis and Lemma 6, we know 
that if & is even then |d[v.] — d[v,]| is even, and if & is odd then |d[v.] — d[v,]| is odd. 


Thus the distance x is integral, and m exists in the graph. 


3.6.2 The wavefront algorithm 


The wavefront algorithm is based on BFS, but overcomes the inefficiency BFS has due to 


relocation cost. In this section, we first develop some preliminary concepts and results based 


52 Piecemeal learning of unknown environments 


on an analysis of breadth-first search in city-block graphs. We then present the wavefront 


algorithm, prove its correctness, and show that it runs in linear time. 


Properties of BFS in city-block graphs 


In city-block graphs, BFS can be viewed as exploring the graph in waves that expand outward 
from the start vertex s, much as waves expand from a pebble thrown into a pond. Figure 3.5 
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Figure 3.5: Environment explored by breath-first search, showing only “wavefronts” at odd 
distance to s. 


A wavefront w can then be defined as an ordered list of explored vertices 
(01, 02,+++;Um), m > 1, such that d[v;] = dv] for all 27, and such that 6(2,;, vj41) < 2 for 
all i. (As we shall prove in Lemma 9, the distance between adjacent points in a wavefront is 
always exactly equal to 2.) We call d[w] = d[v,] the distance of the wavefront. 

There is a natural “successor” relationship between BFS wavefronts, as a wavefront at 
distance ¢ generates a successor at distance t+ 1. We informally consider a wave to be a 
sequence of successive wavefronts. Because of obstacles, however, a wave may split (if it hits 
an obstacle) or merge (with another wave, on the far side of an obstacle). Two wavefronts are 
sibling wavefronts if they each have exactly one endpoint on the same obstacle and if the waves 
to which they belong merge on the far side of that obstacle. The point on an obstacle where the 


waves first meet is called the meeting point m of the obstacle. In the northern region, meeting 
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points are always on the north side of obstacles, and each obstacle has exactly one meeting 


point on its northern side. See Figure 3.6. 


Figure 3.6: Splitting and merging of wavefronts along a corner of an obstacle. Illustration 
of meeting point and sibling wavefronts: w, and we are sibling wavefronts which belong to 
different “waves.” The waves merge at the meeting point. 


Lemma 9 A wavefront can only consist of diagonal segments. 


Proof: By definition a wavefront is a sequence of vertices at the same distance to s for which the 
distance between adjacent vertices is at most 2. It follows from Lemma 7 and 8 that neighboring 
points in the grid cannot be in the same wavefront. Therefore, the distance between adjacent 


vertices is exactly 2. Thus, the wavefront can only consist of diagonal segments. | 


We call the points that connect diagonal segments (of different orientation) of a wavefront 
peaks or valleys. In the northern region, a peak is a vertex on the wavefront that has a larger 
y-coordinate than the y-coordinates of its adjacent vertices in the wavefront, and a valley is a 
vertex on the wavefront that has a smaller y-coordinate than the y-coordinates of its adjacent 
vertices (see Figure 3.7). 

The initial wavefront is just alist containing the start point s. Until a successor of the initial 
wavefront hits an obstacle, the successor wavefronts in the northern region consist of two diag- 
onal segments connected by a peak. This peak is at the same z-coordinate for these successive 
wavefronts. Therefore, we say that the shape of the wavefronts does not change. In the northern 
region a wavefront can only have descendants that have a different shape if a descendant curls 
around the northern corners of an obstacle, or if it merges with another wavefront, or if it splits 


into other wavefronts. These descendants may then have more complicated shapes. 
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Figure 3.7: Shapes of wavefronts. Illustration of peaks and valleys, and front and back of an 
obstacle. The meeting point is the lowest point in the valley. 


A wavefront w splits whenever its hits an obstacle. That is, if a vertex v; in the wavefront 
is on the boundary of an obstacle, w splits into wavefronts w, = (v1,02,...,0;) and wy = 
(0, Vidi,+++;Um). Wavefront w; propagates around the obstacle in one direction, and wavefront 
We. propagates around in the other direction. Eventually, some descendant wavefront of w, and 
some descendant wavefront of w. will have a common point on the boundary of the obstacle— 
the meeting point. The position of the meeting point is determined by the shape of the wave 
approaching the obstacle. (In the proof of Lemma 8, vertex m is a meeting point and we showed 
how to calculate its position once the length & of the north side of the obstacle and the shortest 
path distances of the vertices v, and v,, at the north-east and north-west corners of the obstacle 
are known: the distance from v,, to the meeting point m is (k + d[v,,] — d[v.])/2.) 

In the northern region, the front of an obstacle is its south side, the back of an obstacle is 
its north side, and the sides of an obstacle are its east and west sides. A wave always hits the 
front of an obstacle first. Consider the shape of a wave before it hits an obstacle and its shape 
after it passes the obstacle. If a peak of the wavefront hits the obstacle (but not at a corner), 
this peak will not be part of the shape of the wave after it “passes” the obstacle. Instead, the 
merged wavefront may have one or two new peaks which have the same «-coordinates as the 
sides of the obstacle (see Figure 3.7). The merged wavefront has a valley at the meeting point 


on the boundary of the obstacle. 
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Description of the wavefront algorithm 


The wavefront algorithm, presented in this section, mimics BFS in that it computes exactly 
the same set of wavefronts. However, in order to minimize relocation costs, the wavefronts 
may be computed in a different order. Rather than computing all the wavefronts at distance ¢ 
before computing any wavefronts at distance t+ 1 (as BFS does), the wavefront algorithm will 
continue to follow a particular wave persistently, before it relocates and pushes another wave 
along. 

We define expanding a wavefront w = (v1, v2,...,2) aS computing a set of zero or more 
successor wavefronts by looking at the set of all unexplored vertices at distance one from any 
vertex in w. Every vertex v in a successor wavefront has d[v] = d[w]+1. The robot starts 
with vertex on one end of the wavefront and moves to all of its unexplored adjacent vertices. 
The robot then moves to the next vertex in the wavefront and explores its adjacent unexplored 
vertices. It proceeds this way down the vertices of the wavefront. 


The following lemma shows that a wavefront of / vertices can be expanded in time O(I). 


Lemma 10 A robot can expand a wavefront w = (v1, ¥2,...,01) by traversing at most 2(1 — 


1) +2[1/2] + 4 edges. 


Proof: To expand a wavefront w = (v1, v2,...,v;) the robot needs to move along each vertex in 
the wavefront and find all of its unexplored neighbors. This can be done efficiently by moving 
along pairs of unexplored edges between vertices in w. These unexplored edges connect | of 
the vertices in the successor wavefront. This results in at most 2(/ — 1) edge traversals, since 
neighboring vertices are at most 2 apart. The successor wavefront might have / + 2 vertices, 
and thus at the beginning and the end of the expansion (i.e., at vertices v; and v;), the robot 
may have to traverse an edge twice. In addition, at any vertex which is a peak, the robot may 
have to traverse an edge twice. Note that a wavefront has at most [!/2] peaks. Thus, the total 


number of edge traversals is at most 2(1— 1) + 2[1/2] +4. a 


Since our algorithm computes exactly the same set of wavefronts as BFS, but persistently 
pushes one wave along, it is important to make sure the wavefronts are expanded correctly. 


There is really only one incorrect way to expand a wavefront and get something other than 
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what BFS obtained as a successor: to expand a wavefront that is touching a meeting point 
before its sibling wavefront has merged with it. Operationally, this means that the wavefront 


algorithm is blocked in the following two situations: 
(a) It cannot expand a wavefront from the side around to the back of an obstacle before 
the meeting point for that obstacle has been set (see Figure 3.8). 
(b) It cannot expand a wavefront that touches a meeting point until its sibling has arrived 
there as well (see Figure 3.9). 


A wavefront wz blocks a wavefront w, if wz must be expanded before w, can be safely expanded. 


We also say we and wy, interfere. 


NS 


Figure 3.8: Blockage of w, by wa. Wavefront w, has finished covering one side of the obstacle 
and the meeting point is not set yet. 


SS 


Figure 3.9: Blockage of w, by w2. Wavefront w, has reached the meeting point on the obstacle, 
but the sibling wavefront w. has not. 


A wavefront w is an expiring wavefront if its descendant wavefronts can never interfere with 
the expansion of any other wavefronts that now exist or any of their descendants. A wavefront w 
is an expiring wavefront if its endpoints are both on the front of the same obstacle; w will expand 
into the region surrounded by the wavefront and the obstacle, and then disappear or “expire.” 


We say that a wavefront expires if it consists of just one vertex with no unexplored neighbors. 
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Figure 3.10: Triangular areas (shaded) delineated by two expiring wavefronts. 


Procedure WAVEFRONT-ALGORITHM is an efficient optimally interruptible search algorithm 
that can be used to create an efficient piecemeal learning algorithm. It repeatedly expands one 
wavefront until it splits, merges, expires, or is blocked. The WAVEFRONT-ALGORITHM takes as 
an input a start point s and the boundary coordinates of the environment. It calls procedure 
CREATE-MONOTONE-PATHS to explore four monotone paths (see Section 3.6.1) and define the 


four regions. Then procedure EXPLORE-AREA is called for each region. 


WAVEFRONT-ALGORITHM (s, boundary) 
1 create monotone paths 
2 For region = north, south, east, and west 


3 initialize current wavefront w := (s) 
4 EXPLORE-AREA (ww, region) 
5 take a shortest path to s 


For each region we keep an ordered list LD of all the wavefronts to be expanded. In the north- 
ern region, the wavefronts are ordered by the z-coordinate of their west-most point. Neighboring 
wavefronts are wavefronts that are adjacent in the ordered list £ of wavefronts. Note that for 
each pair of neighboring wavefronts there is an obstacle on which both wavefronts have an 
endpoint. 

Initially, we expand each wavefront in the northern region from its west-most endpoint to 
its east-most endpoint (i.e., we are expanding wavefronts in a “west-to-east” manner). The 
direction of expansion changes for the first time in the northern region when a wavefront is 


blocked by a wavefront to its west (the direction of expansion then becomes “east-to-west”). In 
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fact, the direction of expansion changes each time a wavefront is blocked by a wavefront that 
is in the direction opposite of expansion. We introduce this notion of expanding wavefronts 
in either “west-to-east” or “east-to-west” directions in order to simplify the analysis of the 
algorithm. 

We treat the boundaries as large obstacles. The north region has been fully explored when 
the list L of wavefronts is empty. Note that vertices on the monotone paths are considered 
initially to be unexplored, and that expanding a wavefront returns a successor that is entirely 
within the same region. 

Each iteration of EXPLORE-AREA expands a wavefront. When EXPAND is called on a wave- 
front w, the robot starts expanding w from its current location, which is a vertex at one of the 
endpoints of wavefront w. It is often convenient, however, to think of EXPAND as finding the 
unexplored neighbors of the vertices in w in parallel. 

Depending on what happens during the expansion, the successor wavefront can be split, 
merged, blocked, or may expire. Note that more than one of these cases may apply. 

Procedures MERGE and SPLIT (see following pages) handle the (not necessarily disjoint) 
cases of merging and splitting wavefronts. Note that we use call-by-reference conventions for 
the wavefront w and the list L of wavefronts (that is, assignments to these variables within 
procedures MERGE and SPLIT affect their values in procedure EXPLORE-AREA). Each time 
procedure RELOCATE(w, dir) is called, the robot moves from its current location to the appro- 
priate endpoint of w: in the northern region, if the direction is “west-to-east” the robot moves 
to the west-most vertex of w, and if the direction is “east-to-west,” the robot moves to the 
east-most vertex of w. 

Procedure RELOCATE(w, dir) can be implemented so that when it is called, the robot sim- 
ply moves from its current location to the appropriate endpoint of w via a shortest path 
in the explored area of the graph. However, for analysis purposes, we assume that when 
RELOCATE(w, dir) is called the robot moves from its current location to the appropriate end- 


point of w as follows. 


e When procedure RELOCATE(w,, dir) is called in line 5 of EXPLORE-AREA, the robot tra- 
verses edges between the vertices in wavefront w, to get back to the appropriate endpoint 


of the newly expanded wavefront. 


3.6 Linear time algorithms for city-block graphs 


EXPLORE-AREA (ww, region) 

initialize list of wavefronts [ := (w) 

initialize direction dir := west-to-east 

Repeat 
EXPAND current wavefront w to successor wavefront w, 
RELOCATE (w,, dir) 
current wavefront w := ws, 
If w is a single vertex with no unexplored neighboring vertices 

Then 


remove w from ordered list LZ of wavefronts 


If £ is not empty 
Then 


w := neighboring wavefront of w in direction dir 
RELOCATE (w, dir) 


Else 
replace w by w, in ordered list L of wavefronts 
If the second back corner of any obstacle(s) 
has just been explored 
Then set meeting points for those obstacle(s) 
If w can be merged with adjacent wavefront(s) 
Then MERGE (w, L, region, dir) 
If w hits obstacle(s) 
Then SPLIT (w, L, region, dir) 
If L not empty 


Then 
If w is blocked by neighboring wavefront w’ in direction 


D € {west-to-east, east-to-west } 
Then 
dir := D 
While w is blocked by neighboring wavefront w’ 
Do 
w= w' 
RELOCATE (w, dir) 
Until L is empty 
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e When procedure RELOCATE(w,, dir) is called in line 13 of EXPLORE-AREA, the robot 


traverses edges along the boundary of an obstacle. 


e When procedure RELOCATE(w,, dir) is called in line 9 of MERGE, the robot traverses 
edges between vertices in wavefront w to get to the appropriate endpoint of the newly 


merged wavefront. 


e When procedure RELOCATE(w,, dir) is called in line 30 of EXPLORE-AREA, the robot 
traverses edges as follows. Suppose the robot is in the northern region and at the west- 
most vertex of wavefront wo, and assume that w is to the east of wo. Note that both wo 
and w are in the current ordered list of wavefronts L. Thus there is a path between the 


robot’s current location and wavefront w which “follows the chain” of wavefronts between 


wo and w. That is, the robot moves from wp) to w as follows. Let w,,we,...,w, be the 
wavefronts in the ordered list of wavefronts between wo and and w, and let 69, 0,,...bn44 
be the obstacles separating wavefronts wo, w1,..., Ws, w (i.e., obstacle bp is between wo 


and w,, obstacle 6; is between w, and wz, and so on). Then to relocate from wo to w, the 
robot traverses the edges between vertices of wavefront wp to get to the east-most vertex 
of wo which is on obstacle 69. Then the robot traverses the edges of the obstacle bp to get 
to the west-point vertex of w ,, and then the robot traverses the edges between vertices 
in wavefront w, to get to the east-most vertex of w, which is on obstacle 6,. The robot 
continues traversing edges in this manner (alternating between traversing wavefronts and 


traversing obstacles) until it is at the appropriate end vertex of wavefront w. 


MERGE (w, L, region, dir) 
1 remove w from list L of wavefronts 
2 While there is a neighboring wavefront w’ with which w can merge 


remove w’ from list L of wavefronts 


merge w and w’ into wavefront w” 
w= wl 


put w in ordered list L of wavefronts 
If w is not blocked 
Then RELOCATE (w, dir) 
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Wavefronts are merged when exploration continues around an obstacle. A wavefront can be 
merged with two wavefronts, one on each end. 

When procedure SPLIT is called on wavefront w, we note that the wavefront is either the 
result of calling procedure EXPAND in line 4 of EXPLORE-AREA or the result of calling procedure 
MERGE in line 19 of EXPLORE-AREA. Once wavefront w is split into wo,...,w,, we update the 


ordered list £ of wavefronts, and update the current wavefront. 


SPLIT (w, L, region, dir) 
1 split w into appropriate wavefronts wo,...,W, in standard order 
2 remove w from ordered list L of wavefronts 


3 For:=0 Ton 


put w; on ordered list L of wavefronts 
If dir = west-to-east 

Then w:= wo 

Else w:= wy, 


Correctness of the wavefront algorithm 


The following theorems establish the correctness of our algorithm. 


Theorem 8 The algorithm EXPLORE-AREA expands wavefronts so as to maintain optimal in- 


terruptibility. 


Proof: This is shown by induction on the distance of the wavefronts. The key observations 


are: 


e There is a canonical shortest path from any vertex v to s which goes south whenever 


possible, but east or west around obstacles. 
e A wavefront is never expanded beyond a meeting point. 


We show that the algorithm maintains optimal interruptibility by knowing the canonical 
shortest path from any explored vertex to the start vertex s. We refer to this as the shortest 
path property. We show that the algorithm maintains the shortest path property by induction 
on the number of stages in the algorithm. Each stage of the algorithm is an expansion of a 


wavefront. 
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The shortest path property is trivially true when the number of stages k = 1. There is 
initially only one wavefront, the start point. Now we assume all wavefronts that exist just after 
the k-th stage satisfy the shortest path property, and we want to show that all wavefronts that 
exist just after the & + 1-st stage also satisfy the shortest path property. 

Consider a wavefront w in the k-th stage which the algorithm has expanded in the k + 1-st 
stage to w,. We claim that all vertices in w, have shortest path length d[w]+ 1. Note that 
any vertex in w, which is directly north of a vertex in w definitely has shortest path length 
d[w] +1. This is because there is a shortest path from any vertex v to s which goes south 
whenever possible, but if it is not possible to go south because of an obstacle, it goes east or 
west around the obstacle. 

The only time any vertex v in w, is not directly north of a vertex in w is when w is expanded 
around the back of an obstacle. This can only occur for a vertex that is either the west-most or 
east-most vertex of a wavefront in the north region. Without loss of generality we assume that 
v is the west-most point on w, and v is on the boundary of some obstacle 6. Note that w is 
expanded around the back of an obstacle only when the meeting point is determined. Because 
the algorithm only expands any wavefront until it reaches the meeting point of an obstacle, 
vertex v is not to the west of the meeting point. The algorithm knows that v has a shortest 
path from s that goes through v, and along the obstacle to v. Thus the algorithm satisfies the 


shortest path property for the & + l1-st stage. | 


Theorem 9 /f the region is not completely explored, there is always a wavefront that is not 


blocked. 


Proof: We consider exploration in the north region. The key observations are: 
e Neighboring wavefronts cannot simultaneously block each other. 


e The east-most wavefront in the north region cannot be blocked by anything to its east, 
and the west-most wavefront in the north region cannot be blocked by anything to its 


west. 


Thus the robot can always “follow a chain” of wavefronts to either its east or west to find an 


unblocked wavefront. 
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A neighboring wavefront is either a sibling wavefront or an expiring wavefront. An expiring 
wavefront can never block neighboring wavefronts. In order to show that neighboring wavefronts 
cannot simultaneously block each other, it thus suffices to show next that sibling wavefronts 
cannot block each other. We use this to show that we can always find a wavefront w# which 
is not blocked. The unblocked wavefront # nearest in the ordered list of wavefronts [ can be 
found by “following the chain” of blocked wavefronts from w to w#. By following the chain of 
wavefronts between w and w# we mean that the robot must traverse the edges that connect the 
vertices in each wavefront between w and w# in L and also the edges on the boundaries of the 
obstacles between these wavefronts. Note that neighboring wavefronts in list L each have at 
least one endpoint that lies on the boundary of the same obstacle. 

Before we show that sibling wavefronts cannot block each other we need the following 
terminology. The first time an obstacle is discovered by some wavefront, we call the point that 
the wavefront hits the obstacle the discovery point. (Note that there may be more than one 
such point. We arbitrarily choose one of these points.) In the north region, we split up the 
wavefronts adjacent to each obstacle into an east wave and a west wave. We call the set of all 
these wavefronts which are between the discovery point and the meeting point of the obstacle 
in a west-to-east manner the west wave. We define the east wave of an obstacle analogously. 

The discovery point of an obstacle 6 is always at the front of b. The wavefront that hits 
at b is split into two wavefronts, one of which is in the east wave and one of which is in the 
west wave of the obstacle. We claim that a descendent wavefront w, in the west wave and 
a descendant wavefront wy. in the east wave cannot simultaneously block each other. Assume 
that the algorithm is trying to expand w, but that wavefront we blocks w,. Wavefront w2 can 
only block w, if one of the following two cases applies. In both cases, we show that w, cannot 


also block wy. 


Case 1: Wavefront w, is about to expand to the back of obstacle b, but both of the 
back corners of obstacle 6 have not been explored, and thus the meeting point has not 
been determined. Wavefront w. can only be blocked by wy, if we is either already at the 
meeting point of the obstacle or about to expand to the back of the obstacle. Since none 
of the back corners of obstacle b have been explored, neither of these two possibilities 


holds. Thus, wavefront w, does not block wy. 
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Case 2: Wavefront w, has reached the meeting point at the back of b. Therefore, both 


back corners of the obstacle have been explored and w, is not blocking we. 


We have just shown that if we blocks w, then w, cannot also block wy. Thus, the algorithm 
tries to pick we as the nearest unblocked wavefront to w,. However, w2 may be blocked by its 
sibling wavefront w3 on a different obstacle 6’. For this case, we have to show that this sibling 
wavefront w3 is not blocked, or that its sibling wavefront w4 on yet another obstacle 6” is not 
blocked and so forth. Without loss of generality, we assume that the wavefronts are blocked 
by wavefronts towards the east. Proceeding towards the east along the chain of wavefronts will 
eventually lead to a wavefront which is not blocked—the east-most wavefront in the northern 
region. The east-most wavefront is adjacent to the initial monotone east-north path. Therefore, 


it cannot be blocked by a wavefront towards the east. 


Theorem 10 The wavefront algorithm is an optimally interruptible piecemeal learning algo- 


rithm for city-block graphs. 


Proof: To show the correctness of a piecemeal algorithm that uses our wavefront algorithm 
for exploration with interruption, we show that the wavefront algorithm maintains the shortest 
path property and explores the entire environment. 

Theorem 8 shows by induction on shortest path length that the wavefront algorithm mimics 
breadth-first search. Thus it is optimally interruptible. 

Theorem 9 shows that the algorithm does not terminate until all vertices have been explored. 


Correctness follows. | 


Efficiency of the wavefront algorithm 


We now show the number of edges traversed by the piecemeal algorithm based on the wavefront 
algorithm is linear in the number of edges in the city-block graph. 

We first analyze the number of edges traversed by the wavefront algorithm. Note that the 
robot traverses edges when procedures CREATE-MONOTONE-PATHS, EXPAND, and RELOCATE 


are called. In addition, it traverses edges to get back to s between calls to EXPLORE-AREA. 
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These are the only times the robot traverses edges. Thus, we count the number of edges 
traversed for each of these cases. In Lemmas 11 to 14, we analyze the number of edges traversed 
by the robot due to calls of RELOCATE. Theorem 11 uses these lemmas and calculates the total 


number of edges traversed by the wavefront algorithm. 


Lemma 11 An edge is traversed at most once due to relocations after a wavefront has expired 


(RELOCATE in line 13 of EXPLORE-AREA). 


Proof: Assume that the robot is in the northern region and expanding wavefronts in a west-to- 
east direction. Suppose wavefront w has just expired onto obstacle 6 (i.e., it is a single vertex 
with all of its adjacent edges explored). The robot now must relocate along obstacle 5 to its 
neighboring wavefront w’ to the east. Note hat w’ is also adjacent to obstacle 6, and therefore 
the robot is only traversing edges on the obstacle b. 

Note that at this point of exploration, there is no wavefront west of w which will expire 
onto obstacle 6. This is because expiring wavefronts are never blocked, and thus the direction 
of expansion cannot be changed due to an expiring wavefront. So, when a wavefront is split and 
the direction of expansion is west-to-east, the robot always chooses the west-most wavefront to 
expand first. Thus, the wavefronts which expire onto obstacle b are explored in a west to east 
manner. Thus relocations after wavefronts have expired on obstacle 6 continuously move east 


along the boundary of this obstacle. a 


Lemma 12 An edge is traversed at most once due to relocations after wavefronts have merged 


(RELOCATE in line 9 of MERGE). 


Proof: Before a call to procedure MERGE, the robot is at the appropriate end vertex of 
wavefront w. Let’s assume that the robot is in the northern region and expanding wavefronts 
in a west-to-east direction. Thus the robot is at the west-most vertex of wavefront w. Note that 
wavefront w can be merged with at most two wavefronts, one at each end, but only merges with 
the wavefront to the west of w actually cause the robot to relocate. Suppose wavefront w is 
merged with wavefront w’ to its west to form wavefront w”. Then, if the resulting wavefront w” 


is unblocked, procedure RELOCATE is called and the robot must traverse w” to its west-most 
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vertex (i.e., also the west-most vertex of w’). However, since wavefront w” is unblocked, w” can 


immediately be expanded and is not traversed again. | 


Lemma 13 At most one wavefront from the east wave of an obstacle is blocked by one or more 
wavefronts in the west wave. At most one wavefront from the west wave is blocked by one or 


more wavefronts in the east wave. 


Proof: Consider the west wave of an obstacle. By the definition of blocking, there are only 
two possible wavefronts in the west wave that can be blocked. One wavefront is adjacent to 
the back corner of the obstacle. Call this wavefront w,. The other wavefront is adjacent to the 
meeting point of the obstacle. Call this wavefront we. 

We first show that if w, is blocked then w. will not be blocked also. Then we also know 
that if we is blocked then w, must not have been blocked. Thus at most one wavefront in the 
west wave is blocked. 

If w, is blocked by one or more wavefronts in the east wave then these wavefronts can be 
expanded to the meeting point of the obstacle without interference from w,. That is, wavefront 
w, cannot block any wavefront in the east wave, and thus there will be no traversals around 
the boundary of the obstacle until the east wave has reached the meeting point. At this point, 
the west wave can be expanded to the meeting point without any wavefronts in the east wave 
blocking any wavefronts in the west wave. 

Similarly, we know that at most one wavefront from the west wave is blocked by one or 


more wavefronts in the east wave. | 


Lemma 14 An edge is traversed at most three times due to relocation after blockage (RELO- 


CATE in line 30 of EXPLORE-AREA). 


Proof: Without loss of generality, we assume that the wavefronts are blocked by wavefronts 
towards the east. Proceeding towards the east along the chain of wavefronts will eventually 
lead to a wavefront which is not blocked, since the east-most wavefront is adjacent to the initial 
monotone east-north path. 

First we show that any wavefront is traversed at most once due to blockage. Then we show 


that the boundary of any obstacle is traversed at most twice due to blockage. Note that pairs 
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of edges connecting vertices in a wavefront may also be edges which are on the boundaries of 
obstacles. Thus any edge is traversed at most three times due to relocation after blockage. 

We know from Theorem 9 that there is always a wavefront that is not blocked. Assume that 
the robot is at a wavefront w which is blocked by a wavefront to its east. Following the chain of 
wavefronts to the east leads to an unblocked wavefront w’. This results in one traversal of the 
wavefronts. Now this wavefront w’ is expanded until it is blocked by some wavefront w’’. Note 
that wavefront w’” cannot be to the west of w’, since we know that the wavefront west of w’ is 
blocked by w’. (We show in the proof of Theorem 9 that if w; blocks wz then wz does not block 
w,.) The robot will not move to any wavefronts west of wavefront w’ until a descendant of w’ 
no longer blocks the wavefront immediately to its west. Once this is the case, then the west 
wavefront can immediately be expanded. Similarly, we go back through the chain of wavefronts, 
since - as the robot proceeds west - it expands each wavefront in the chain. Thus the robot 
never traverses any wavefront more than once due to blockage. 

Now we consider the number of traversals, due to blockage, of edges on the boundary of 
obstacles. As wavefronts expand, their descendant wavefronts may still be adjacent to the 
same obstacles. Thus, we need to make sure that the edges on the boundaries of obstacles are 
not traversed too often due to relocation because of blockage. We show that any edge on the 
boundary of an obstacle is not traversed more than twice due to relocations because of blockage. 
That is, the robot does not move back and forth between wavefronts on different sides of an 
obstacle. Lemma 13 implies that each edge on the boundary of the obstacle is traversed at 
most twice due to blockage. 

Thus, since the edges on the boundary of an obstacle may be part of the pairs of edges 
connecting vertices in a wavefront, the total number of times any edge can be traversed due to 


blockage is at most three. a 


Theorem 11 The wavefront algorithm is linear in the number of edges in the city-block graph. 


Proof: We show that the total number of edge traversals is no more than 15|E|. Note that when 
the procedures CREATE-MONOTONE-PATHS, EXPAND, and RELOCATE are called, the robot 
traverses edges in the environment. In addition, the robot traverses edges in the environment 


to get back to s after exploration of each of the four regions. These are the only times the 


68 Piecemeal learning of unknown environments 


robot actually traverses edges in the environment. Thus, to calculate the total number of edge 
traversals, we count the edge traversals for each of these cases. 

The robot traverses the edges on the monotone paths once when it explores them, and once 
to get back to the start point. This is clearly at most 2|F| edge traversals. The robot walks 
back to s four times after exploring each of the four regions. Thus the number of edges traversed 
here is at most 4|/|. The proof of Lemma 10 implies that the total number of edge traversals 
caused by procedure EXPAND is at most 2|/|. We now only need to consider the edge traversals 
due to calls to procedure RELOCATE. 

Procedure RELOCATE is called four times within EXPLORE-AREA and MERGE. The four calls 
are due to expansion (line 5 of EXPLORE-AREA), expiring (line 13 of EXPLORE-AREA), merging 
(line 9 of MERGE) and blocking (line 30 of EXPLORE-AREA). Relocations after expanding a 
wavefront results in a total of || edge traversals. Lemma 11 shows that edges are traversed 
at most twice due to expiring wavefronts. Lemma 12 shows that edges are traversed at most 
once due to relocations after merges. Finally, Lemma 14 shows that edges are traversed at most 
three times due to relocations after blockage. Thus the total number of edge traversals due to 
calls of procedure RELOCATE is at most 7|F|. 

Thus the total number edges traversed by the wavefront algorithm is at most 15|#|. A more 


careful analysis of the wavefront algorithm can improve the constant factor. | 


Theorem 12 A piecemeal algorithm based on the wavefront algorithm runs in time linear in 


the number of edges in the city-block graph. 


Proof: This follows immediately from Theorem 10 and Theorem 11. | 


3.6.3. The ray algorithm 


We now give another efficient optimally interruptible search algorithm, called the ray algorithm. 
The ray algorithm is a variant of DFS that always knows a shortest path back to s. This thus 
yields another efficient piecemeal algorithm for searching a city-block graph. This algorithm is 
simpler than the wavefront algorithm, but may be less suitable for generalization, because it 


appears more specifically oriented towards city-block graphs. 
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The ray algorithm also starts by finding the four monotone paths, and splitting the graph 
into four regions to be searched separately. The algorithm explores in a manner similar to 
depth-first search, with the following exceptions. Assume that it is operating in the northern 
region. The basic operation is to explore a northern-going “ray” as far as possible, and then 
to return to the start point of the ray. Along the way, side-excursions of one-step are made to 
ensure the traversal of east-west edges that touch the ray. Optimal interruptibility will always 
be maintained: the ray algorithm will not traverse a ray until it knows a shortest path to s from 
the base of the ray (and thus a shortest path to s from any point on the ray, by Lemma 6). 

The high-level operation of the ray algorithm is as follows. (See Figure 3.11.) From each 
point on the (horizontal segments of the) monotone paths bordering the northern region, a 
north-going ray is explored. On each such ray, exploration proceeds north until blocked by an 
obstacle or the boundary of the city-block graph. Then the robot backtracks to the beginning 
of the ray and starts exploring a neighboring ray. As described so far, each obstacle creates 
a “shadow region” of unexplored vertices to its north. These shadow regions are explored as 
follows. Once the two back corners of an obstacle are explored, the shortest paths to the vertices 
at the back of an obstacle are then known; the “meeting point” is then determined. Once the 
meeting point for an obstacle is known, the shortest path from s to each vertex on the back 
border of the obstacle is known. The robot can then explore north-going rays starting at each 
vertex at the back border of the obstacle. There may be further obstacles that were all or 
partially in the shadow regions; their shadow regions are handled in the same manner. 

We note that not all paths to s in the “search tree” defined by the ray algorithm are 
shortest paths; the tree path may go one way around an obstacle while the algorithm knows 
that the shortest path goes the other way around. However, the ray algorithm is nonetheless 


an optimally interruptible search algorithm. 


Theorem 13 The ray algorithm is a linear-time optimally interruptible search algorithm that 


can be transformed into a linear-time piecemeal learning of a city-block graph. 


Proof: This follows from the properties of city-block graphs proved in Section 3.6.1, and the 
above discussion. In the ray algorithm each edge is traversed at most a constant number 
of times. The linearity of the corresponding piecemeal learning algorithm then follows from 


Theorem 7. | 
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Figure 3.11: Operation of the ray algorithm. 


3.7 Piecemeal learning of undirected graphs 


For piecemeal learning of arbitrary undirected graphs, we again turn our attention to breadth- 
first search. As we mentioned earlier, standard BFS is efficient only when when the robot can 
efficiently switch or “teleport” from expanding one vertex to expanding another. In contrast, our 
model assumes a more natural scenario where the robot must physically move from one vertex 
to the next. We change the classical BFS model to a more difficult teleport-free exploration 
model, and give efficient approximate BFS algorithms where the robot does not move much 
further away from s than the distance from s to the unvisited vertex nearest to s. The teleport- 
free BFS algorithms we present never visit a vertex more than twice as far from s as the nearest 
unvisited vertex is from s. 

Our techniques for piecemeal learning of arbitrary undirected graphs are inspired by the work 
of Awerbuch and Gallager [6, 7]. We observe that our learning model bears some similarity to 
the asynchronous distributed model. This similarity is surprising and has not been explored in 
the past. 


Our main theorem for piecemeal learning of arbitrary undirected graphs is: 


Theorem 14 Piecemeal learning of an arbitrary undirected graph G = (V,E) can be done in 


time O(E + Vite), 


Proof: Following the RECURSIVE-STRIP algorithm, given in Section 3.7.3, the robot always 
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knows a path from its current location back to the start vertex of length at most the radius 
of the graph. Thus RECURSIVE-STRIP is efficiently interruptible. The running time of this 
algorithm is O(B + V20\V ee V bese) — O(E+V14%)), By Theorem 7, this algorithm can be 
interrupted efficiently to give a piecemeal learning algorithm with running time O(E+V!t°), 


In the remainder of this section, we give three algorithms for piecemeal learning undirected 
graphs. We first give a simple algorithm that runs in O(F + V'?) time. We then give a 
modification of this algorithm that runs in O((F + V'°)logV) time. Although this algorithm 
has slightly slower running time, we are able to make it recursive, giving a third algorithm 
with almost linear running time: it achieves O(E + V'+°)) running time. The most efficient 


previously known algorithm has O(£ + V’) running time. 


3.7.1 Algorithm STRIP-EXPLORE 


This section describes an efficiently interruptible algorithm for undirected graphs with running 
time O(£ + V'°). It is based on breadth-first search. 

A layer in a BFS tree consists of vertices that have the same shortest path distance to the 
start vertex. A frontier vertex is a vertex that is incident to unexplored edges. A frontier vertex 
is expanded when the robot has traversed all the unexplored edges incident to it. 

The traditional BFS algorithm expands frontier vertices layer by layer. In the teleport- 
free model, this algorithm runs in time O(F + rV), since expanding all the vertices takes time 
O(£), and visiting all the frontier vertices on layer i can be performed with a depth-first search 
of layers 1...2 in time O(V), and there are at most r layers. The procedure Locat-BFS 
describes a version of the traditional BFS procedure that has been modified for our teleport- 
free BFS model in two respects. First, the robot does not relocate to frontier vertices that have 
no unexplored edges. Second, it only explores vertices within a given distance-bound LF of the 
given start vertex s. (The first modification, while seemingly straightforward, is essential for 
our analysis of StRip-EXPLORE which uses LOCAL-BFS as a subroutine.) A procedure call of 
the form LocaL-BFS(s,r), where s is the start vertex of the graph and r is its radius, would 
cause the robot to explore the entire graph. 


Awerbuch and Gallager [6, 7] give a distributed BFS algorithm which partitions the network 
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in strips, where each strip is a group of £ consecutive layers. (Here L is a parameter to be 
chosen.) All vertices in strip i— 1 are expanded before any vertices in strip i are expanded. 


Their algorithms use as a subroutine breadth-first type searches with distance L. 


LOcAL-BFS(s, L) 
1 For:=0 To L—1Do 
2 let verts = all vertices at distance 7 from s 
For each u Everts Do 
If uw has any incident unexplored edges 


Then 


traverse each unexplored edge 
incident to u 


3 
4 
5 
6 relocate to u 
7 
8 
9 


relocate to s 


Our algorithm, STRIP-EXPLORE, searches in strips in anew way. See Figure 3.12. The robot 
explores the graph in strips of width L. First the robot does LocaL-BFS(s, L) to explore the 
first strip. It then explores the second strip as follows. Suppose there are & frontier vertices 
01, 09,---, 0% in layer L; each such vertex is a source vertex for exploring the second strip. A 
naive way for exploring the second strip is for the robot for each 7, to relocate to v;, and then 
find all vertices that are within distance L of v; by doing a BFS of distance-bound FL from 2»; 
within the second strip. The robot thus traverses a forest of k BFS trees of depth L, completely 
exploring the second strip. The robot then has a map of the BFS tree of depth FL for the first 
strip and a map of the BFS forest for the second strip, enabling it to create a BFS tree of 
depth 2L for the first two strips. The robot continues, strip by strip, until the entire graph is 
explored. 

The naive algorithm described above is inefficient, due to the overlap between the trees in the 
forest at a given level, causing portions of each strip to be repeatedly re-explored. The algorithm 
STRIP-EXPLORE presented below solves this problem by using the LocAL-BFS procedure as 


the basic subroutine, instead of using a naive BFS. (See Figure 3.12.) 
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frontier depth r 
source vertices we. : 
vertices ad : 


global BFS tree 


strip of depth L 


Figure 3.12: In the naive algorithm, the shaded areas are retraversed completely. In STRIP- 
EXPLORE, the shaded areas are passed through more than once only if necessary to get to 
frontier vertices. 


STRIP-EXPLORE(s, L,r) 
numstrips = [r/L] 
sources = {s} 
For 2 = 1 To numstrips Do 


relocate to u 
LOcAL-BFS(u, L) 


1 

2 

3 

4 For each u €sources Do 

5 

6 

7 sources = all frontier vertices 


In STRIP-EXPLORE, the robot searches in a breadth-first manner, but ignores previously 
explored territory. The only time the robot traverses edges that have been previously explored 
is when moving to a frontier vertex it is about to expand. This results in retraversal of some 


edges in previously explored territory, but not as many as in the naive algorithm. 
Theorem 15 StTrip-EXPLORE runs in O(F + V1‘) time. 


Proof: First we count edge traversals for relocating between source vertices for a given strip. 
For these relocations, the robot can mentally construct a tree in the known graph connecting 
these vertices, and then move between source vertices by doing a depth-first traversal of this 
tree. Thus the number of edge traversals due to relocations between source vertices for this 
strip is at most 2V. Since there are [r/L] strips, the total number of edge traversals due to 
relocations between source vertices is at most cal 2V< (+ + 1) QoS a + 2V. 


Now we count edge traversals for repeatedly executing the LocAL-BFS algorithm. First, 
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Figure 3.13: Contrasting BFS and Local-BFS: Consider a BFS of depth 5 from s,, followed 
by a BFS of depth 5 from sj. (The depth of the strip is £ = 5.) The BFS from s» revisits 
vertices a,b,c,d,e. On the other hand, if the BFS from s, is followed by a LocaL-BFS from 
8, then it only revisits d,c,e. After edge (f,d) is found, vertex e is a frontier vertex that needs 
to be expanded. 


for the robot to expand all vertices and explore all edges, it traverses 2h edges. Next, each 
time the relocate in line 9 of procedure LOCAL-BFS is called, at most L edges are traversed. 
To account for relocations in line 6 of procedure LOCAL-BFS, we use the following scheme for 
“charging” edge traversals. Say the robot is within a call of the LocaL-BFS algorithm. It has 
just expanded a vertex u and will now relocate to a vertex v to expand it. Vertex v is charged 
for the edges traversed to relocate from u to v. (We are only considering relocations within the 
same call of the LocAL-BFS algorithm; relocations between calls of the LOCAL-BFS algorithm 
were considered above.) Source vertices are not charged anything. Moreover, the robot can 
always relocate from u to v by going from u to the source vertex of the current local BFS, and 
then to v, traversing at most 2 edges. Thus, each vertex is charged at most 2L when it is 
expanded. LocaL-BF'S never relocates to a vertex v unless it can expand vertex v (i.e., unless 
v is adjacent to unexplored edges). Thus, all relocations are charged to the expansion of some 
vertex, and the total number of edge traversals due to relocation is at most 2LV. 

Thus the total number of edge traversals is at most 2rV/L + 2V + 3LV + 2E, which is 
O(rV/L4+ LV + E). When L is chosen to be \/7, this gives O(E + V'°) edge traversals. I 


Procedure STRIP-EXPLORE, and the generalizations of it given in later sections, maintain 
that A < 26 at all times—the robot never visits a vertex more than twice as far from s as the 


nearest unvisited vertex is from s. The worst case is while exploring the second strip. 
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3.7.2 Iterative strip algorithm 


We now describe ITERATIVE-STRIP, an algorithm similar to the STRIP-EXPLORE algorithm. 
It is an efficiently interruptible algorithm for undirected graphs inspired by Awerbuch and 
Gallager’s [6] distributed iterative BFS algorithm. Although its running time of O((V'? + 
FE) log V) is worse than the running time of STRip- EXPLORE, its recursive version (described in 
Section 3.7.3) is more efficient. (It is not clear how to recursively implement STRIP-EXPLORE 
as efficiently, because the trees in a strip are not disjoint.) 

With ITERATIVE-STRIP, the robot grows a global BFS tree with root s strip by strip, in a 
manner similar to STRIP-EXPLORE. Unlike StRip-EXPLORE, here each strip is processed several 
times before it has correctly deepened the BFS tree by \/r. We next explain the algorithm’s 
behavior on a typical strip by describing how a strip is processed for the first time, and then 


for the remaining iterations. 


ITERATIVE-STRIP(s, 1) 
1 For: =1 To 7 Do 
For each source vertex u in strip 7 Do 
relocate to u 
BFS from u to depth \/r, but do not enter previously 
explored territory 
While there are any active connected components Iterate 
For each active connected component c Do 
Repeat 
let v1, v2, v3,... be active frontier vertices 
exclusively in c with smallest depth among 
active frontier vertices in ¢c 


relocate to each of v1, v2, v3,..., and expand 
Until no more active frontier vertices exclusively in ¢ 


determine new and active connected components 


In the first iteration, a strip is explored much as in STRIP-EXPLORE. The robot explores 
a tree of depth \/r from each source vertex, by exploring in breadth-first manner from each 
source vertex, without re-exploring previous trees. Whenever the robot finds a collision edge 
connecting the current tree to another tree in the same strip, it does not enter the other tree. 
Unlike STRIP-EXPLORE, the robot does not traverse explored edges to get to the active frontier 


vertices on other trees. Therefore, after the first iteration, the trees explored are approximate 
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BFS trees that may have frontier vertices with depth less than /r from some source vertex. 
These vertices become active frontier vertices for the next iteration. Thus, the current strip 
may not yet extend the global BFS tree by depth \/r, so more iterations are needed until all 


frontier vertices are inactive and the global BFS tree is extended by depth \/7 (see Figure 3.14). 


current strip 
active 


See 
> ® frontier 


¢ vertices 


<> 
ceph Dee 


global BFS tree 


Figure 3.14: The iterative strip algorithm after the first iteration on the fourth strip. Two 
connected components c,,¢, have been explored. The collision edges e; and e2 connect the 
first three approximate BFS trees. The dashed line shows how source vertices 51, 52,63 connect 
within the strip. There are three active frontier vertices with depth less than D + vr. 


In the second iteration (see Figure 3.15), the robot uses the property that two trees connected 
by a collision edge form a connected component within the strip. (The graph to be explored is 
connected, and thus forms one connected component; but we refer to connected components of 
the explored portion of the graph contained within the strip.) The robot need not traverse any 
edges outside the current strip to relocate between these active frontier vertices in the same 
connected component. In the second and later iterations, the robot works on one connected 
component at a time. 

The robot explores active frontier vertices in one connected component as follows. It com- 
putes (mentally) a spanning tree of the vertices in the current strip. This spanning tree lies 
within the strip. Let d be the least depth of any active frontier vertex in the component from a 
source vertex. It visits the vertices in the strip in an order determined by a DFS of the spanning 
tree. As it visits active frontier vertices of depth d, it expands them. It then recomputes the 
spanning tree (since the component may now have new vertices) and again traverses the tree, 
expanding vertices of the appropriate next depth d’. Traversing a collision edge does not add 


the new vertex to the tree, since this vertex has been explored before. This process continues 
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global BFS tree 


depth D +vr 


Figure 3.15: The iterative strip algorithm after the second iteration. Now the circled vertices 
which were active frontier vertices at the beginning of the iteration are expanded. One of the 
expansions resulted in a collision edge. Now the strip consists of only one connected component 
(shaded area). There are six frontier vertices which become source vertices of the next strip. 


All frontier vertices have depth D + \/r. 


(at most /r times) until no active frontier vertex in the connected component has distance less 
than \/r from some source vertex in the component. 

The robot handles each connected component in turn, as described above. In the next 
iteration it combines the components now connected by collision edges, and explores the new 
active frontier vertices in these combined components. Lemma 15 states that at most log V 
iterations cause all frontier vertices to become not active. That is, all frontier vertices are depth 
Jr from the source vertices of this strip. These frontier vertices are the new sources for the 


next strip. 


Lemma 15 At most log V iterations per strip are needed to explore a strip and extend the 


global BFS tree by depth \/r. 


Proof: If there are initially / source vertices, then after the first iteration there are at most / 
connected components. If a component does not collide with another active component, then 
it will have no active frontier vertices for the next iteration. The only active components in 
the next iteration are those that have collided with other components, and thus, each iteration 
halves the number of components with active frontier vertices. After at most log V iterations 
there is no connected component with active frontier vertices left. The robot then has a complete 
map of the current strip and of the global BFS tree built in previous strips, so it can combine 


this information and extend the global BFS tree by depth \/r. | 
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Theorem 16 ITERATIVE-STRIP runs in time O((E + V!°)logV). 


Proof: We first count the number of edge traversals within a strip. Let V; and EF; be the 
number of vertices and edges explored in strip 7. For each component, vertices of distance ¢ 
from some source vertex are expanded by computing a spanning tree of the component, doing 
a DFS of the spanning tree, and expanding all vertices of distance ¢ from some source vertex 
(line 9). At each iteration (line 5), components are disjoint, so relocating to all vertices in the 
strip of distance exactly ¢ takes at most O(V;) edge traversals. Thus, in one iteration, relocating 
to all vertices in the strip within distance \/r takes at most O(,/7V;) edge traversals. Moreover, 
note that in order for the robot to expand each vertex, it traverses at most O( F;) edges. Thus, 
the total number of edge traversals for strip 2 in one iteration is O(£;+./rV;). Combining this 
with Lemma 15, the total number of edge traversals within strip ¢ to completely explore strip 2 
takes O((£; + /7rV;) log V) edge traversals. 

Now we count edge traversals for relocating between source vertices in strip 7. As in the 
proof of Theorem 15, in each iteration the robot traverses at most 2V edges to relocate between 
source vertices. Since there are at most log V iterations, this results in 2V log V edge traversals 
between source vertices to explore strip i. Thus, the total number of edge traversals to explore 
strip i is O((£; + V7rV;) log V + 2V log V). Summing over the \/r disjoint strips gives O((£ + 
VrV )logV +2VV/rlogV) = O((F + VrV) log V) = O((F + V1") log V). | 


3.7.3. A nearly linear time algorithm for undirected graphs 


This section describes an efficiently interruptible algorithm RECURSIVE-STRIP, which gives a 
piecemeal learning algorithm with running time O(F + V't?). Recursive-Strip is the 
recursive version of ITERATIVE-STRIP; it provides a recursive structure that coordinates the 
exploration of strips, of approximate BFS trees, and of connected components in a different 
manner. The robot still, however, builds a global BFS tree from start vertex s strip by strip. 
The robot expands vertices at the bottom level of recursion. 

In REcuRSIVE-STRIP, the depth of each strip depends on the level of recursion (see Fig- 
ure 3.16). If there are k levels of recursion, then the algorithm starts at the top level by splitting 


the exploration of G into r/d,_, strips of depth d,_,. Each of these strips is split into d,_,/d,_»2 
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Figure 3.16: The recursive strip algorithm processing an approximate BFS tree from source 
vertex s, to depth d,_, = L. Recursive calls within the tree are of depth d,_». = L’. 


searches of strips of depth d,_2, etc. We have r= d, > dy_y >... > d, > dp = 1. 

Each recursive call of the algorithm is passed a set of source vertices sources, the depth to 
which it must explore, and a set 7 of all vertices in the strip already known to be less than 
distance depth from one of the sources. The robot traverses all edges and visits all vertices 
within distance depth of the sources that have not yet been processed by other recursive calls 
at this level. RECURSIVE-STRIP({s},7, {s}) is called to explore the entire graph. 

At recursion level 2, the algorithm divides the exploration into strips and processes each strip 
in turn, as follows. Suppose the strip has / source vertices v,,...,v;. The strip is processed in 
at most log/ = O(log V) iterations. In each iteration, the algorithm partitions 7 into maximal 
sets 71, 7>,..., 7, such that each set is known to be connected within the strip. Let $, denote 
the set of source vertices in 7,. A DFS of the spanning tree of the vertices T gives an order for 
the source vertices in $1, 5,...,.5;; this spanning tree is used for efficient relocations between 
these source vertices. Note that all source vertices are known to be connected through the 
spanning tree of the vertices in T, but they might not be connected within the substrips. Since 
relocations between the vertices in $, in the next level of recursion use a spanning tree of T,, 
for efficiency the vertices of 7, must be connected within the strip. After partitioning the 
vertices into connected components within the strip, for each connected component T, the 
robot relocates (along a spanning tree) to some arbitrary source vertex in S,. It then calls the 
algorithm recursively with S$, the depth of the strip, and the vertices 7, which are connected 


to the sources S, within the strip. 
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RECURSIVE-STRIP (sources, depth, T) 
If depth = 1 
Then 

let v1, V2,..., 0% be the depth-first ordering of sources 
in spanning tree 

For :=1 To & Do 
relocate to v; 
If v; has adjacent unexplored edges 

Then traverse v,;’s incident edges 
T =TU {newly discovered vertices } 
Return 


Else 
determine neat depth 
number-of-strips <— depth/next-depth 


For 2 = 1 To number-of-strips Do 
determine set of source vertices 
For j = 1 To number-of-iterations Do 
partition vertices in 7’ into maximal sets 7), 7>,..., 7); 
such that vertices in each 7, are known to be 
connected within strip 2 
For each 7. in suitable order Do 
let S. be the source vertices in T’, 
relocate to some source s € S, 
RECURSIVE-STRIP(S,, neat-depth, T.) 
T=TUuT, 
relocate to some s € sources 
Return 
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The remaining iterations in the strip combine the connected components until the strip is 
finished. Then the robot continues with the next strip in the same level of recursion. Or, if it 
finished the last strip, it relocates to its starting position and returns to the next higher level 


of recursion. 
Theorem 17 REcURSIVE-STRIP runs in time O(F + Vit), 


Proof: At a particular call of RECURSIVE-STRIP, there are 4 places the robot traverses edges: 
1. expansion of vertices in line 7 
2. relocating to sources in lines 5 and 19 
3. relocations due to recursive calls in line 20 
4. relocation back to a beginning source vertex in line 22 


We count edge traversals for each of these cases. First we give some notation. We consider 
the top level of recursion to be a level-& recursive call, and the bottom level of recursion to 
be a level-0 recursive call. For a particular level-¢ call of RECURSIVE-STRIP, let C; denote the 
number of edge traversals due to relocations, and let #; denote the number of distinct edges 
that are traversed due to relocation. Let V; denote the number of vertices incident to these 
edges and whose incident edges are all known at the end of this call. Let p; be a uniform upper 
bound on C;/V;. Thus, if the depth of recursion is k then the total number of edge traversals 
is bounded by O(V p;). 

First we observe that each vertex is expanded at most once, so there are at most O( LF + V) 
edge traversals due to exploration at line 7 in the bottom level of recursion. 

For a level-2 call, we count the number of edge traversals for relocation between source 
vertices (lines 5 and 19). Since all the source vertices in the call are connected by a tree of 
size O(V;), relocating to all source vertices at the start of one strip takes O(V;) edge traversals. 
With d;/d;_, strips and logV iterations per strip, there are V; log ve edge traversals for 
relocations between source vertices. 

We now count traversals for recursive calls (line 20) within a level-i call. Note that our 


algorithm avoids re-exploring previously explored edges. Thus, for a level-2 call, when working 
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on a particular strip J, for each iteration within this strip, the sets of vertices whose edges are 
explored in each recursive call are disjoint. Suppose that, in this strip, in one iteration the 
procedure makes & recursive calls, each at level i — 1. Then let CO, 1<j< k, denote the 
number of edge traversals due to relocations resulting from the j-th recursive call, and let Vi) 
denote the number of vertices adjacent to these edges. Furthermore, let V;,; denote the number 
of vertices which are in strip / of this procedure call at recursion level 7. Then we would like first 
to calculate iat CY, which is the number of edge traversals due to relocation in recursive 
calls in one iteration within this strip. This is at most an pV) = pi-1 an V2. Since 
the recursive calls are disjoint, via V2? = V,,, and thus the number of edge traversals due to 
relocations in recursive calls in one iteration within this strip is at most p;_,V;;. Finally, since 
there are log V iterations in each strip, and all strips are disjoint from each other, the number 
of edge traversals due to recursive calls is at most p;_/V; log V. 

Finally, note that we relocate once at the end of each procedure call of RECURSIVE-STRIP 
(see line 22). This results in at most V; edge traversals. 

Thus, the number of edge traversals due to relocation (not including relocations for expand- 
ing vertices) is described by the recurrence C; < V; log ve + pi-1V; log V + V;.. Normalizing 


by V;, we get the following recurrence: 


dj 
pi = ( +1) log V + O(1) 
t-1 


Solving the recurrence for pz gives: 


k=1 
Pr < ( di ) tog V+ (2) tog? v Lee (2) log" V + po log" V + > log’ V 
die—4 dy_2 do i=0 
d dj, d 
< ( ‘ ) tog v + ( ‘ +) Jog! V _ (=) log* V + O(log* V) 
dy_1 dy_2 dp 


We note that po = O(1), since at the bottom level, if there are V’ vertices expanded, then the 
number of edge traversals due to relocation is O(V’). The product of the first k terms in the 
recurrence is #(logV)OTUF/? = rlogV )\OTVE?, We choose dy_1,d,—2,-.. by setting each of 


the first & terms equal to the k-th root of this product. (Note that this also specifies how to 
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calculate depth d;_; from depth d;.) Substituting, we get: 
pe < kri/® (log VET? + O(log" V). 


We find the value of & that minimizes this by taking the logarithm and differentiating with 
1/2 
respect to k. Choosing k = () and simplifying gives us py < 20¢V}°6¥ le6le6Y) | and 


thus C; is at most V200V 8 leslo8Y) which is V't°™), Adding the edge traversals for relocation 


to the edge traversals for expansion of vertices gives us O(E + V'+°) edge traversals total. I 


3.8 An Application to Treasure Hunting 


We now consider an application of our algorithms to the problem of finding a treasure (or a lost 
child, or a particular landmark) in an unknown, potentially infinite graph G = (V,F). If the 
robot searching for the treasure knows that the treasure is close to its start location, it should 
explore in a manner such that it does not get too far away from this location. 

We give the procedure TREASURE-SEARCH, which uses the RECURSIVE-STRIP algorithm as 
a subroutine. If the treasure is distance é- away from the source vertex, this algorithm maintains 
the condition that the robot is never further from the source than A, where A < 6p + o(6r). 
Following procedure TREASURE-SEARCH, the robot traverses O( EF + V'+?™) edges, where F 
and V are the total number of distinct edges and vertices within radius A from the source. 

The robot explores the graph for the treasure in phases. In each phase, the size of the strip to 
be explored changes. The change at phase i depends on €; = 1/V%. Initially, the robot explores 


the graph out to distance r, = 1+ €,. Next, the robot extends its exploration by a factor of 


1l+e,. That is, the size of the next strip is (1+e,)(1+¢.)—(1+e,), and at the end of the second 


phase, the robot has learned the graph out to distance rg = (1+ €,)(1+ €). After extending 


the next strip, the robot has learned the graph out to distance rs = (1+ &)(1 + €:)(1 + 63), 
and so on. In each phase 2, the robot initially calls RECURSIVE-STRIP from each of the source 
vertices (vertices at distance r;_,). When the robot finds collision edges, it does not re-explore 
edges. Thus, within each phase, it may take up to log V iterations (as in ITERATIVE-STRIP and 


RECURSIVE-STRIP) before it has explored the entire strip. 
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TTREASURE-SEARCH(s) 

1 2=0 

2 ronal 

3. Do until treasure is found 

t~=7i4+1 

= Vi 

r= ri: (1+ 6) 

If: =1 
Then 

RECURSIVE-STRIP({s}, 71, {5}) 

Else 


let T’ be be the set of source vertices distance 


—_ 


4 
5 
6 
7 
8 
9 
0 
i 


—_ 


r;-1 away from s 
For j = 1 To number-of-iterations Do 
partition vertices in 7’ into maximal sets 7),..., 7), 


a 
QW bo 


such that vertices in each 7’, are known to be 
connected within strip 2 

For each 7). in suitable order Do 
let S. be the source vertices in 7’, 
relocate to some source s € S, 
RECURSIVE-STRIP(S,, neat-depth, T.) 
T=TUT, 


Lemmas 16 and 17 bound the number of phases in the TREASURE-SEARCH procedure. Using 
Lemma 16, we can show that the robot does not get too far away from the source vertex, and 


using Lemma 17, we can bound the number of edges the robot traverses. 
Lemma 16 The number of phases in TREASURE-SEARCH is at least log ép. 


Proof: Since €; > € > €3..., we know that, for any j,(1+e)1+e)...d+6)<(U+a). 
Thus, if we let 7 be the smallest number such that (1+ «)’ > 6p, then we know that the 
number of phases 7 to reach the treasure at 6p is at least 7. Since «, = 1, we have 2/ > 6p, or 


j > log bp. | 


Lemma 17 The number of phases in TREASURE-SEARCH is at most 41n? 6p +1. 
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Proof: A treasure at depth 67 = 1 is found in the first phase, so we consider only 67 > 1. We 
know that for any 7, (1+6;)? < (1+e)(1te)...(1+e;). Thus, if 1+e6;)’ > 6p, we know that the 
number of phases 7 is at most 7. So we show the lemma by showing that (1+ €4)n2 5, yt’ br > bp. 


Equivalently, we would like to show 41n? 6, In(1 + €4in25,) = A ln? 6p In(1 + ) > In 6p. 


ahi 
For |a| < 1, using a Taylor expansion, we have In(1+a) = e-h 42-24. -. ForOQ<a< 1, 
we have In(1+2a) > a—- am So 4 In? $p In(1+~4+_) > (41n’ 67)(4 —+—) = 2Inép —- 1/2, 


2In 67 2inép—-8In? Sp 


which is at least In ép for 67 > 2. | 


Theorem 18 The robot is never further than 6p + 67/,/log 67 from the source vertex. 


Proof: Let A be the furthest distance the robot gets from the source vertex. Let 2 be the 
number of phases that need to be explored to get out to depth 67. Then, A — ép is at most 
the depth of the strip in the i-th phase. That is, A- 6p < (L4+eq)(14+e)...1+ 6) - 
(l+e)1+e)...d+6-1) = 1tea)d+e)...1+ 6-1G@ < ére;. Lemma 16 shows that 
the total number of strips explored is at least logép. Thus, ¢ is at most 1/,/logé;, and 
A < bp + br/VTOESr = br + ofr). : 


Theorem 19 Procedure TREASURE-SEARCH traverses at most O(E + V'+%) edges, where E 


and V are the total number of distinct edges and vertices within radius A from the source. 


Proof: Since the edges in the different phases are disjoint, the number of edges traversed, 
ignoring relocations between source vertices in line 16, is at most O(F + V't%). To get 
between source vertices in line 16, a spanning tree of the known vertices can be used. (Note 
that for recursive calls of RECURSIVE-STRIP, the algorithm relocates between source vertices 
using the vertices connected within the appropriate strip.) By Lemma 17, we know the number 
of phases is at most 41n? 6,+1, and in each phase it may take up to log V iterations to explore 
the entire strip. Thus there are an additional (4In* 6; + 1)V logV edge traversals due to 
relocations between source vertices, and this gives a total of O( F + V'+%) edge traversals for 


the entire TREASURE-SEARCH procedure. | 
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3.9 Conclusions 


We have presented an efficient O(F + V!+°) algorithm for piecemeal learning of arbitrary, 
undirected graphs. For the special case of city-block graphs, we have given two linear time 
algorithms. We leave as open problems finding linear time algorithms (if they exist) for the 


piecemeal learning of: 
e grid graphs with non-convex obstacles, 
e other tesselations, such as triangular tesselations with triangular obstacles, and 
e more general classes of graphs, such as the class of planar graphs. 


e arbitrary, undirected graphs 


CHAPTER 4 


Learning-based algorithms for 
protein motif recognition 


4.1 Introduction 


One of the most important problems in computational biology is that of predicting how a 
protein will fold in three dimensions when we only have access to its one-dimensional amino acid 
sequence. Structure prediction has practical importance, as the biological function of a protein 
depends upon its structure or fold. Unfortunately, determining the three-dimensional structure 
of a protein is very difficult. Experimental approaches such as NMR and X-ray crystallography 
are expensive and time-consuming (they can take years), and often do not work at all. Therefore, 
computational techniques that predict protein structure based on already available sequence 
data can help speed up the understanding of protein functions. 

Animportant first step in tackling the protein folding problem is a solution to the structural 
motif recognition problem: given a known local three-dimensional structure, or motzf, determine 
whether this motif occurs in a given amino acid sequence, and if so, in what positions. In this 
chapter, we focus on a special type of a-helical motif, known as the coiled coil motif (see 
section 4.2), although the techniques presented can be applied to other motifs as well. 

Most approaches to the motif recognition problem work only for motifs that are already well- 
studied — that is, they are known to occur in many sufficiently diverse proteins. This knowledge 


usually comes from biologists who have studied many examples of the motif. However, there 
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are many motifs for which only a small subset of examples are known, and this subset is often 
not rich enough to be representative of the motif. Thus, for lack of data, current prediction 
methods ranging from straightforward sequence alignments to more complicated methods such 
as those based on profiles of the motifs often fail to successfully identify such motifs. 

For example, in the case of the coiled coil motif, most known instances are 2-stranded coiled 
coils (i.e, coiled coils consisting of 2 a-helices). As a result, known prediction algorithms work 
well for predicting 2-stranded coiled coils [14, 13, 12, 42, 58, 63], but do not work as well for the 
related 3-stranded coiled coil motif (i.e., coiled coils consisting of 3 a-helices), due to the lack 
of known 3-stranded coiled coil sequences. That is, for 3-stranded coiled coils, these algorithms 
have a large amount of overlap between the scores for sequences that do not contain coiled coils 


and sequences that do. 


Our results 


In this chapter, we use learning theory to improve existing methods for protein structural motif 
recognition, particularly in the case where only a few examples of the motif are known. Our 
main result is a linear-time learning algorithm that uses information obtained from a database 
of sequences of one motif to make predictions about a related or similar motif. 

The problem we explore can be viewed as a concept learning problem, where the algorithm 
is given labeled and unlabeled examples, and its goal is to find a concept which gives labels 
to all the examples. Unlike many concept learning frameworks, this problem is not completely 
supervised—this type of learning, which we refer to as semi-supervised learning, is often neces- 
sary in real-life learning problems. We find this to be true in our test domain, where our goal is 
to identify sequences that contain coiled coils from a set of protein sequences which may or may 
not contain coiled coils. In particular, we are interested in recognizing both 2- and 3-stranded 
coiled coils. Unfortunately, the majority of data we have is comprised of 2-stranded coiled coils. 
In addition, although many biologists are interested in 3-stranded coiled coils, there is little 
well analyzed data available on them. Thus, because of the lack of data and current biological 
knowledge, supervised learning (i.e., the algorithm is given a large enough set of examples of 
both 2- and 3-stranded coiled coils on which to train) is not currently feasible for our problem, 


and semi-supervised or even unsupervised learning (with no labeled examples) is the only type 
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of learning which is possible. At first glance, this learning problem seems like a challenging 
problem, since we are trying to come up with an algorithm which generalizes the data we have 
for 2-stranded coiled coils to also pick out 3-stranded coiled coils. However, we show empirically 
that for our test domain, semi-supervised learning gives excellent results. In particular, we have 
tested our program and show that our algorithm’s performance is substantially better than that 
of previously known algorithms for recognizing 3-stranded coiled coils. 

Our algorithm starts with an original database of a base motif, and the goal is to develop a 
more general database of a target motif, which is related to the base motif in structure. (The 
target motif includes the base motif as a special case.) In other words, we would like to convert 
a good predictor for the base motif into a good predictor for the target motif. Our algorithm 


has the following key features: 


e The algorithm iteratively scans a large database of test sequences to find sequences that 
are presumed to fold into the target motif. The selected sequences are then used to update 
the parameters of the algorithm; these updates affect the performance of the algorithm 


in the next iteration. 


e In each iteration, the algorithm scores all the sequences based on its current estimates of 


the parameters and the theoretical framework developed in [12]. 


e In each iteration, the algorithm uses randomness to select which sequences are presumed 


to fold into the target motif. 


e The selected sequences are used in the beginning of the next iteration to update the 


parameters of the algorithm in a Bayesian-like weighting scheme. 


There are several ways in which our iterative algorithm is kept running in a “safe” fashion, 
without increasing the false positive rate by incorporating sequences into the final database that 
do not fold into the motif. First, we begin with a mathematically sound scoring subroutine, 
that experimentally has a low false positive rate. Second, our method of computing likelihoods 
ensures that only a certain fraction of all residues are scored as positive examples of the motif 
(see section 4.3). Finally, while evaluating our program, we run the program with sequences 
that are known not to contain coiled coils, and this has helped us determine when the algorithm 


is performing well. 
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This methodology does not appear to have been explored much in the biological literature. 
Although a few papers have dealt with iterative algorithms [73, 3, 46, 36], they do not use 
randomness and weighting for updating of parameters. In our experience, we find that these 


components of the algorithm are critical to achieving good performance. 


Implementation results 


In order to demonstrate the efficacy of our methods, we test them on the domain of 2- and 
3-stranded coiled coils (see section 4.4). 

First, we show how to use our methods to recognize 3-stranded coiled coils given examples of 
2-stranded coiled coils. In other words, starting with a base motif of 2-stranded coiled coils, we 
learn the target motif comprising of 2- and 3-stranded coiled coils. The initial predictor already 
has good performance on 2-stranded coiled coils, so we test our algorithm by its performance 
on 3-stranded coiled coils. 

We evaluate our algorithm on 3-stranded coiled coils with respect to two statistical cross 
validation tests: the “leave one out” test and the “leave half out” test. In the first scenario, 
the algorithm starts with data from the 2-stranded coiled coil database, and iterates on a test 
set that contains sequences which are known to form 3-stranded coiled coils, sequences which 
are thought to form 3-stranded coiled coils, sequences for which no structural information is 
available, and sequences which are known not to contain coiled coils. The category of each 
sequence in this test set is not known to the algorithm, and the sequences which do not contain 
coiled coils are given to the algorithm in order to test its robustness. At the end of the procedure, 
the algorithm is evaluated by the number of the 3-stranded coiled coil sequences which it 
recognizes. Each time a sequence that is present in the database the algorithm is building is 
scored, it is removed from that database to avoid the possibility of unfairly biasing the test. In 
this scenario, we find that our algorithm greatly enhances the recognition of 3-stranded coiled 
coils, without affecting its performance on sequences that are known not to contain coiled coils. 
In particular, we are able to select 93% of the sequences that are conjectured by biologists to 
contain coiled coils, with no false positives out of the 286 sequences known not to contain coiled 
coils. Previously, the best performance without false positives is 67%. 


We also test our algorithm on 3-stranded coiled coils in a much more difficult scenario. 
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In particular, instead of cross validating our procedure by leaving out just one sequence at 
a time when testing, the algorithm iterates on test sequences that contain only half of the 
sequences known to form 3-stranded coiled coils. It is then evaluated by its performance on 
the 3-stranded coiled coil sequences that are not iterated upon. In this scenario, we also find 
improved performance. The 3-stranded coiled coil sequences are split in half 3 times, and on 
average, the algorithm is able to select 85% of the left out 3-stranded coiled coil sequences, 
with likelihood scores higher than that of the highest scoring negative sequence. On average, 
the previous best performance without false positives is 67%. 

Finally, we test our program on subfamilies of 2-stranded coiled coils using the leave one 
out criterion. For 2-stranded coiled coils, we have a good data set consisting of a diverse set 
of sequences. However, to test our program, we simulate a limited data problem by testing 
our program LEARNCOIL on subfamilies of 2-stranded coiled coils. That is, one subfamily of 
2-stranded coiled coils is chosen to make up the base motif, and the class of all 2-stranded coiled 
coils is the target motif. Here we find that we have excellent performance; i.e., we are able to 
completely learn the coiled coil regions in our entire 2-stranded coiled coil database starting 
from a database consisting of coiled coils from any one subfamily. Based on our experiments, 
such performance does not appear to be possible without the use of our iterative algorithm. In 


particular, the best performance for the non-iterative approach ranges between 70 and 88%. 


Biological significance 


As a consequence of this work, we have identified many new sequences that we believe con- 
tain coiled coils or coiled-coil-like structures, such as the envelope proteins of mouse hepatitis 
virus and human rotavirus. One of our more striking findings is the existence of one and oc- 
casionally two coiled-coil-like regions in the envelope proteins of many retroviruses, including 
Human Immunodeficiency Virus (HIV), Simian Immunodeficiency Virus (SIV), and Human T- 
cell Lymphotropic Virus (HTLV). Independent experimental investigations have also predicted 
these coiled-coil-like regions in HIV and SIV [19, 56]. 
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4.2 Further background 


The coiled coil motif is found in fibrous proteins, DNA binding proteins, and in tRNA-synthetase 
proteins. Recently it has been proposed that the 3-stranded coiled coil motif acts as the cell 
fusion mechanism for many viruses, and algorithms for predicting these structures could aid in 
the study of how viruses invade cells. Computational methods [14, 58] have already identified 
such coiled coil regions in influenza virus hemagluttinin and Moloney murine leukemia virus 
envelope protein; both of these predictions have been corroborated in the laboratory [30, 40]. 

Coiled coils are a particular type of a-helix, consisting of two or more a-helices wrapped 
around each other with a slight left-handed superhelical twist. Coiled coils have a cyclic repeat 
of seven positions, a, 6, c, d, e, f, and g (see Figure 1). The seven positions are spread out 
along two turns of the helix. Coiled coils show a characteristic heptad repeat with hydrophobic 
residues found in positions a and d, and this repeat makes coiled coils particularly amenable to 
recognition by computational techniques. 

Computational methods have been quite successful for predicting coiled coils [63, 58, 42, 12, 


13, 14]. These techniques can be described, broadly, as follows: 
1. Collect a database of known coiled coils and available amino acid subsequences. 


2. Determine whether the unknown sequence shares enough distinguishing features with the 


known coiled coils to be considered a coiled coil. 


Standard approaches [63, 58] look at the frequencies of each amino acid residue in each of 
the seven repeated positions. Overall this singles method does pretty well. When the NEWCOIL 
program of Lupas et al. [58] is tested on the PDB (the database of all solved protein structures), 
it finds all sequences which contain coiled coils. On the other hand, 2/3 of the sequences it 
predicts to contain coiled coils do not. That is, the false positive rate for the standard method 
is quite high. 

These approaches based on the singles method build a table from the coiled coil database 
that represents the relative frequency of each amino acid in each position; that is, there is a table 
entry for each amino acid/coiled coil position pairing. For example, for Leucine and position 
a, the entry in the table is the percentage of position a’s in the coiled coil database which are 


Leucine, divided by the percentage of residues in Genbank (a large protein sequence database) 
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Figure 4.1: (a) Top view of a single strand of a coiled coil. Each of the seven positions {a, b,c, d,e, f,g} 
corresponds to the location of an amino acid residue which makes up the coiled coil. The arrows between 
the seven positions indicate the relative locations of adjacent residues in an amino acid subsequence. 
The solid arrows are between positions in the top turn of the helix, and the dashed arrows are between 
positions in the next turn of the helix. (b) Side view of a 2-stranded coiled coil. The two coils are next 
to each other in space, with the a position of one next to the d position of another. The coils also slightly 
wrap around each other (not shown here). 


which are Leucine. For example, if the percentage of position a’s in the coiled coil database 
which are Leucine is 27%, and the percentage of residues in Genbank which are Leucine is 9%, 
then the table entry value for the pair Leucine and position a is 3. Intuitively, this table entry 
represents the “propensity” that Leucine is in position a in a coiled coil. 

The singles method approach [58] actually looks at 28—long windows, since stable coiled 
coils are believed to be at least 28 residues long. Thus for each residue, it looks at each possible 
position (a through g), and at all 28-long windows that contain it. It then calculates the relative 
frequencies for each residue in the window. If the product of the relative frequencies for each 
residue in some window is greater than some threshold, it concludes that the residue is part of 
a coiled coil. 

Recently researchers have put this problem within a probabilistic framework [12, 13, 14], 
and have given linear-time algorithms for predicting coiled coils by approximating dependencies 
between positions in the coiled coil using pairwise frequencies. This method for prediction uses 
estimates of probabilities for singles and pair positions. For example, in addition to estimating 


the probability that a Leucine appears in position a of a coiled coil, it also estimates the 
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probability that a Leucine appears in position a of a coiled coil with a Valine appearing in the 
following d position. For a given residue’s contribution to the score, the algorithm considers 
residues at the structurally relevant distances ¢ = 1,2 = 2 andi = 4, calculating the geometric 
mean of the three quantities P(k,k+7%)/P(k+7), where P(k,k +7) is the probability of finding 
residues k and & +7 distance ¢ apart in a coiled coil, and P(& +7) is the probability of finding 
residue k +7 in a coiled coil. 

This method of predicting coiled coils has been very effective. When tested on the PDB, the 
PAIRCOIL algorithm based on this method selects out all sequences that contain coiled coils, 
and rejects all the sequences that do not contain coiled coils. Furthermore, when tested on a 
database of 2-stranded coiled coils (with a sequence removed from the database at the time it 
is scored), each amino acid residue in a coiled coil region is correctly labeled as being part of a 
coiled coil. 

Since the PAIRCOIL algorithm has better performance than the singles method algorithm, 
particularly with respect to the false-positive rate, this is the scoring method we build on, as 
well as the scoring method to which we compare our results. 

Other types of iterative approaches have been applied to sequence alignment and protein 
structure prediction by researchers [73, 3, 46, 36]. Algorithmically, our approach differs from 
these approaches in two major ways. The first is our use of randomness to incorporate sequences 
into our database, and the second is our use of weighting to update the database (see section 4.3). 
In addition, several of these papers are directed toward sequence alignment, and sequence 
alignment is not so effective a tool for predicting coiled coils, as the various subfamilies of coiled 
coils do not align well to each other. Also, since the goal of these other methods is often to 
output potential matching alignments, the testing of these algorithms is quite different. In 
particular, although some of these approaches use the “leave one out” criterion, to the best of 
our knowledge, none of them test performance with the “leave half out” criterion. 

Various machine learning techniques have been applied to the protein structure prediction 
problem. The two main approaches are neural nets (e.g., [47, 67, 59]) and hidden Markov 
models (e.g., [53, 9]). Both of these approaches require adequate data on the target motif, 
since there is a “training session” on sequences that are known to contain the target motif. 


Our approach differs from these methods since it does not require well analyzed data on the 
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Figure 4.2: Our basic learning algorithm. Initially, the algorithm starts off with a test set of examples 
and a set of initial parameters. In each iteration, the algorithm selects new examples, and re-estimates 
its parameters. 


target motif per se. Instead it uses already available data on a base motif and generalizes it 
to recognize the target motif, by running on a large number of sequences, some of which are 
suspected to fold into the target motif. 

Other learning approaches which have been applied to protein structure prediction include 


rule-based methods (e.g., [60]). 


4.3. The algorithm 


We first describe the general framework for our algorithm. Namely, we are initially given a set 
of parameters that help characterize our base concept, and a set of test examples. Our goal is to 
decide which of these test examples are positive examples of some target concept. In addition, 
we know that the target concept is a generalization of the base concept. Our algorithm takes 
advantage of the fact that the base concept is somewhat related to the target concept. In 
particular, once the algorithm has identified some of the test examples that are presumed to be 
related to the base concept, it can modify its database by “adding” these newly found examples. 


Examples are selected by a randomized procedure based on likelihoods. This process is then 
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iterated, as the added examples change the scores of other examples. (See figure 4.2.) 

We have implemented our learning algorithm for the protein motif recognition problem. In 
particular, our learning algorithm LEARNCOIL proceeds as follows. It is given two inputs: a 
database of a base motif which is related to the target motif we are interested in, and a large 
database of iteration test sequences which is comprised of sequences that we believe contain the 
target motif as well as many other sequences of unknown structure. In practice, we generally 
include in the iteration test sequences some fraction of the PIR (a large protein sequence 
database), the sequences from the PDB (the database of solved protein structures) that are 
known not to fold into the target motif, and sequences conjectured by biologists to fold into 
the target motif. 

Initially, the algorithm estimates pair and singles amino acid residue probabilities for the 


motif’s positions. Then the algorithm iterates four basic steps: 


1. The algorithm uses its estimates of the pair and singles probabilities to determine a 
likelihood function, which maps residue scores to a likelihood of the residue belonging to 


the target motif. 


2. The algorithm scores each of the iteration test sequences using the estimated probabilities, 


and calculates the likelihoods for each of these sequences. 


3. The algorithm flips coins with probability proportional to the likelihood of each score to 
determine which parts (if any) of each sequence are presumed to be part of the target 
motif. The residues which are thus determined to be presumed examples of the target 


motif make up the new database for the next iteration. 


4. The algorithm uses the base motif database and the new database just determined in this 
iteration to update its estimates of the singles and pair probabilities for the target motif 


using a Bayesian-like weighting scheme (see section 4.3.4). 


The algorithm continues iterating until the new database stabilizes. 
We now describe each of the components of the algorithm in more detail, using coiled coils 


as an example, although the algorithm can be applied to other protein motifs. 
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4.3.1 Scoring 


In our implementation, we use the PAIRCOIL program described by Berger et al. [14] as our 
scoring procedure, although any good prediction algorithm with a low false positive rate can 
be used for scoring. This scoring method uses correlation methods that incorporate pairwise 
dependencies between amino acids at multiple distances. The scoring procedure gives a residue 
score for each amino acid in a given sequence, as well as a sequence score, which is the maximum 
residue score in the sequence. 

In order to use this scoring procedure, we must have estimates for the probabilities for the 
singles and pair positions for the motif. Initially, we have estimates for the probabilities based 
on the database of sequences of the base motif, and after each iteration of the algorithm, we 
use updated probabilities. In each iteration after the first, when we score a sequence we check 
to see if it was identified in the previous iteration. If it was, we remove this sequence from the 
database and adjust the probabilities before scoring. 

Given good estimates for the probabilities for the singles and pair positions for the motif, 
and reasonable assumptions about dependencies in the motif, the PAIRCOIL scoring method 


which we use as a subroutine is mathematically justified [12]. 


4.3.2 Computing likelihoods 


Once we have a sequence score, we assess it by converting it into a likelihood that the sequence 
contains the target motif. In each iteration of the algorithm, we compute a function that takes 
a residue score and computes the likelihood that the residue is part of the target motif. 

We compute this likelihood function in a manner described in [14]. In particular, every 
sequence in a large sequence database is scored. (Ideally, this large sequence database is the 
PIR. However, in practice, to save time, we use a sampled version of the PIR, which is 1/25-th 
the size; the likelihood function calculated using this sampled PIR is a good approximation 
to the likelihood function calculated using the entire PIR.) The sampled PIR residue score 
histograms are nearly Gaussian distributed with some extra probability mass added on the 
right-hand tail. This extra mass is attributed to residues in the target motif, since they are 
expected to score higher. In the case of the coiled coil motif, given the biological data currently 


available, it is estimated that between 1/50 and 1/30 of residues in the PIR are in a coiled coil. 
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To fit a Gaussian to the histogram data, we calculate the mean so that the extra probability 
mass on the right side of the mean corresponds to between 1/50 and 1/30 of the total mass of 
the PIR. We then compute the standard deviation using only scores below that mean, where 
a Gaussian better fits the histogram data. The likelihood that a residue with a given score is 
a coiled coil is estimated as the ratio of the extra histogram mass above the Gaussian at that 
score (corresponding to data assumed to be coiled) to the total histogram mass at that score. 
A least square fit line is then used to approximate the likelihood function in the linear region 
from 10 to 90 percent. This line then gives an approximation for the likelihoods corresponding 
to all scores. 

One feature of this method of computing likelihoods is that it does not allow too many 
residues to be considered as part of coiled coils. This helps keep the false positive rate of the 


algorithm low. 


4.3.3. Randomized selection of the new database 


Once we have obtained the likelihood function for an iteration, we wish to use the likelihoods 
to build a new database of sequences presumed to fold into the target motif. At the beginning 
of each iteration, our new database contains no sequences. Then for each sequence in the set of 
test sequences, we do the following. First, we score each sequence and then convert its sequence 
score to a likelihood. Next, we draw a number uniformly at random from the interval [0, 1]. If 
the number drawn is less than or equal to the likelihood of the sequence, then the sequence is 
added to the new database. All residues in this sequence that have scores equal to the sequence 
score or greater than the 50% likelihood score (which is the algorithm’s cutoff for a residue 
being in a coiled coil) are added to the database. Once we have processed every sequence in our 
test set, then we have our new database of sequences presumed to fold into the target motif. 
In practice, we find that adding randomness substantially improves the performance of 
our algorithm. In fact, if the procedure is written just to accept sequences that have greater 
than 50% likelihood, then the algorithm fails to recognize many sequences which are known to 
contain 3-stranded coiled coils. On the other hand, if the procedure lowers the threshold value 


for acceptance, then its false positive rate increases. 
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4.3.4 Updating parameters 


Once we have a new database of sequences which are thought to contain the target motif, we 
need to update the parameters used by the algorithm for scoring. In our case, in each iteration 
of the algorithm, the scoring procedure needs updates of the estimates of probabilities for 
singles and pair positions. The most straightforward way to update the probabilities is to use 
a maximum likelihood estimate from frequency counts from the new database. However, this 
does not work that well in practice. Instead, we update each probability by taking a weighted 
average of the probability given by the base motif database and the probability given by the 
new database. 

We now describe a theoretical framework for updating probabilities in this manner in each 
iteration of our algorithm. The approach we give is motivated by a Bayesian viewpoint [45, 15]. 
In particular, we think of the probabilities we are trying to estimate as the parameters of a 
Multinomial distribution, and we use the Dirichlet density to model the prior information we 
have about these probabilities. In fact, the approach we give is not completely Bayesian, as we 
will use the seen data to pick the parameters of the prior distribution; this is sometimes called 
a Bayes/Non-Bayes compromise [45]. 

We will use frequency counts from our databases to estimate singles and pair probabilities. 
For simplicity, we focus on the case of updating singles probabilities; updating pair probabilities 
is analogous. 

Initially, we have a database of sequences which fold into a particular base motif. Thus, for 
each position in the motif, we have a 20-long count vector, one for each of the 20 amino acids. 
For example, for a given database of known coiled coils, for position a, we know how many 
times each amino acid appears. In addition, after each iteration of the algorithm, we have a 
new database of sequences that we have selected and which we presume fold into the target 
motif. This new database also gives us a 20-long count vector for each position in the motif. 

We update the probabilities using these frequency count vectors. In particular, we fix a 
numbering of the amino acids from 1 to 20. Then for each position g in the motif (for coiled 
coils, ¢ € {a,b,c,d,e, f,g}), we have a count vector 7 = (al, wy... 8), where x‘ is the 
number of times amino acid 2? appears in position g of the motif in the base motif database. In 


(a) (a) (@) 


addition, we have a count vector 7” = (y\”, yf y+ +5990), where y;*’ is the number of times 
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amino acid i appears in position g of the motif in the new database (i.e., the database consisting 
of the sequences we have picked in this iteration of the algorithm). 

Let fp? = (p\”, p, . ps?) be the actual probabilities for the amino acids appearing in 
position g of the motif. We assume, for simplicity, that the count vectors for each position 
are independent of each other. Thus, we focus on updating the probabilities of one position 
independent of the other positions. For notational convenience, we fix a position and drop the 
superscript g. We assume that for a fixed position, the count vector is generated at random 
according to the Multinomial distribution with parameter p = (pi, po,..-, P20). The parameters 
Pi, P2,---; P20 are the “true” probabilities of seeing the amino acids in the fixed position in the 
motif we are interested in. These are the parameters we wish to estimate. 

In our case, we have very strong a priori knowledge about the probabilities. Since we are 
trying to learn a particular target structural motif from a related base structural motif, we can 
use the probabilities estimated from the base motif as prior probabilities. In fact, because these 
structural motifs are related, we expect the updated probabilities for the target motif to be 
similar to the original probabilities for the base motif. 

We model our a priori beliefs by the Dirichlet density. The value of a Dirichlet density 
P(a) (with parameter @ = (a1,Q9,...,a%), where a; > 0 and ap = So a;) at a particular point 


E = (%1,%o,...,¢%), where )* 2; = 1 is given by: 


The gamma function , (@) is: 


, (a) =| e "a lda. 
0 


The mean of Dirichlet density is (a, /a9,Q2/a9,..., 4/0), and the larger a is, the smaller 
the variance is. 

Thus a Bayesian estimate for the probabilities p,,pe,..., peo can be found by looking at 
the posterior distribution. The Dirichlet distribution is conjugate for the Multinomial, and the 
posterior distribution is the Dirichlet distribution D(@+ 7) [15, 45]. That is, the new parameter 


of the distribution is the vector sum of the original parameters and the observed data. Thus, a 
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Bayesian estimate for probability p; after seeing the data 7 is 


aT 


20 
Yi 
where yo = Yi 
a + Yo’ ° » 


We still have not addressed the issue of how the parameters of the prior distribution are 
chosen. We depart from the traditional Bayesian approach, and choose the parameters of the 
prior distribution after seeing the data. In particular, since the base motif and the target motif 
are related, we want the base motif database to have a strong effect on the estimates for our 
probabilities, and thus we choose the variance of the prior distribution accordingly. 

The mean of the Dirichlet density is specified by the estimated probabilities of the base 
motif. The variance of the density is picked as follows. If 0 < A < 1 is the effect, or weight, 
that we want the base motif database to have, then we let a; = 2; - Ae, where xo = aa XL; 
and yo = aa y:- (Actually, we have to be careful in the case where z; = 0.) It is easy to 
verify that our estimate for the probability p; is given by As +(1- )*. Namely, our updated 
probability is a weighted average of the probability given by the base motif database and the 
probability given by the new database. 

In practice, we have found that our method of updating probabilities has worked well. It 
is superior to a maximum likelihood approach which uses just the current iteration’s frequency 
counts. These estimates of the probabilities are especially problematic in the zero frequency 
case. Our method also performs better than an unweighted approach using both the initial 
frequency counts and the current iteration’s frequency counts. These estimates of the proba- 
bilities are largely dependent on the size of the original database, and the number of residues 
that are presumed at each iteration to be part of the target motif. In our test domain of coiled 
coils, we found that this method of updating probabilities missed more sequences that contain 
coiled coils than did our method for updating probabilities. 

Using Dirichlet mixture densities as priors to estimate amino acid probabilities has been 
studied by Brown et al. [29]. Their approach uses as a prior the maximum likelihood estimate 
of a mixture Dirichlet density, based on data previously obtained from multiple alignments 


of various sets of sequences. Their approach is a pure Bayesian approach, and their prior 


distribution has a smaller effect on the final probability estimates. 
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4.3.5 Algorithm termination 


The iteration process terminates when it stabilizes; that is, when the number of residues added 
from the previous iteration changes by less than 5%. Usually the procedure converges in around 
six iterations; otherwise, we terminate it after 15 iterations. In practice, we found that the 
algorithm rarely had to to be terminated due to lack of convergence. 

In our implementation, the running time of the entire algorithm is linear in the total number 
of residues in all sequences which are given as input. The basic operation in each iteration is 
scoring every sequence using the PAIRCOIL algorithm. For each sequence, the PAIRCOIL scoring 
program takes time linear in the number of residues. Since we have at most a fixed number of 
iterations, the entire algorithm is linear-time. 

After running LEARNCOIL, the “learned” target concept contains both 2- and 3-stranded 
coiled coils. The problem of distinguishing one set from the other remains. The MultiCoil 
program of Wolf, Kim, and Berger [unpublished results] is being developed for this purpose and 


in initial experiments performs well. 


4.4 Results 


We have implemented our algorithm in a C program called LEARNCOIL. We test our program on 
the domain of 3-stranded coiled coils and subclasses of 2-stranded coiled coils. First we describe 
the databases we use to test the program, and then we follow by describing the program’s 


performance. 


4.4.1 The databases and test sequences 


Our original database of 2-stranded coiled coils consists of 58,217 amino acid residues which 
were gathered from sequences of myosin, tropomyosin, and intermediate filament proteins [14]. 
We also have separate databases containing sequences from each of these protein subclasses 
individually. A synthetic peptide of tropomyosin is the only solved structure among these. 
We test LEARNCOIL on the 3-stranded coiled coils by starting the algorithm with the base 
database of all 2-stranded coiled coils. We test LEARNCOIL on the 2-stranded coiled coils by 


starting the algorithm with a base database of one of the subfamilies of the 2-stranded coiled 
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coils. 

The set of iteration test sequences for testing performance on 3-stranded coiled coils consists 
of the following 5516 sequences: 286 known non-coiled coils from the non-redundant version 
of the PDB created in [14] (the PDB is the database of solved protein structures); 2% of the 
sequences in OWL (a large non-redundant composite database, where no two sequences in 
the database are exactly the same and no two sequences show only “trivial” differences [20]), 
with any obvious members of the PDB removed (2815 total); sequences in OWL whose names 
contain the strings actinin, alpha spectrin, dystrophin, tail fiber, laminin, fibrinogen, env, 
spike, glycoprotein, bacteriophage T4 wac, bacteriophage K3 fibritin, heat shock transcription, 
or macrophage scavenger receptor, as well as the 3-stranded coiled coil mutant for GCN4 (2415 
total, of which many are thought to contain 3-stranded coiled coils, and the 46 sequences given 
below are known to contain them). 

The 3-stranded coiled coil set is comprised primarily of laminin and fibrinogen sequences, 
as well as influenza virus hemagluttinin, Moloney murine leukemia envelope protein, 2 heat 
shock transcription factors, bacteriophage T4 and K3 wac proteins, the trimeric GCN4 mutant, 
2 macrophage scavenger receptors, and bacteriophage T3 and T7 tail fibers. 

Our set of iteration test sequences for 2-stranded coiled coils includes: 1/23 of the PIR 
(1553 total); the 286 known non-coiled coils; and the two of the subfamilies out of myosins, 
tropomyosins, and intermediate filaments. (For example, when we start with a database of 
intermediate filaments, our iteration test sequences include myosins and tropomyosins. ) 

Note that most of the sequences in our 2- and 3-stranded coiled coil data sets do not have 
solved structures. However, there is strong experimental support that they contain coiled coils, 
although often the boundaries of the coiled coil regions are difficult to specify exactly. We do 
not know the three dimensional structure for most of the protein sequences in our iteration test 
sets (except for the sequences from the PDB and portions of the sequences making up the 2- 


and 3-stranded coiled coil data sets). 


4.4.2 Learning 3-stranded coiled coils 


Our techniques improve non-learning based approaches, such as PArRCoIL [14], which often 


fails to identify 3-stranded coiled coil regions. 
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Base Set Evaluation Performance Performance 
Set without LEARNCOIL with LEARNCOIL 


% of seqs # of false % of seqs # of false 
positive seqs positive seqs 


Pair OCs [Agar COS |__| O/H 0/286 


Table 4.1: Learning 3-stranded coiled coils from 2-stranded coiled coils using the leave one 
out criterion. 


We test the algorithm on 3-stranded coiled coils in two ways: the “leave one out” test and 
the “leave half out” test. In both cases, LEARNCOIL improves recognition of 3-stranded coiled 
coils starting with an initial database of 2-stranded coiled coils. We measure LEARNCOIL’s 
performance on the 286 non-coiled coil proteins, and an evaluation set consisting of 3-stranded 
coiled coil sequences. We assume that a false negative prediction has occurred when a sequence 
in the 3-stranded coiled coil evaluation set receives a score with a corresponding likelihood less 
than 50%. We assume a false positive has occurred when a non-coiled coil protein scores at least 
50% likelihood. Since our algorithm is randomized, the final likelihoods are found by averaging 
LEARNCOIL outputs over five runs. 

In the first “leave one out” scenario, the algorithm is run with all the 5516 iteration test 
sequences described in section 4.4.1. Once the algorithm terminates, each of the 46 sequences 
in the 3-stranded coiled coil set is scored with respect to parameters calculated from the new 
database in the final iteration minus the effects of this sequence. That is, since the 46 3- 
stranded coiled coil sequences are included in the iteration test set, if a sequence appears in the 
final database, before scoring this sequence, the sequence is removed to avoid the possibility of 
unfairly biasing the test. 

The weight of the original database (i.e., relative to the new database) was chosen empirically 
to be A= 0.1. This makes sense because 2- and 3-stranded coiled coils are sufficiently different; 
thus, it may require much more weight for the newly identified sequences to effectively broaden 
the new database to contain 3-stranded coiled coils. We also experimented with weights in the 
range 0 < A < 0.5 but A = 0.1 gave the best results. 

Our algorithm LEARNCOIL positively identifies 43 out of 46 (93%) of the 3-stranded coiled 


coil sequences and makes no false positive predictions. In contrast, PAIRCOIL positively identi- 
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fies 31 out of 46 (67%) of the 3-stranded coiled coils and also makes no false positive predictions 
(see Table 4.1). Moreover, using the final databases that LEARNCOIL produced, we are able 
to recognize all the sequences in the 2-stranded coiled coil database. Thus the final databases 
produced by the LEARNCOIL algorithm performs well on both 2- and 3-stranded coiled coils. 

In the second “leave half out” scenario, we split the 3-stranded coiled coil sequence set 
in half in the following manner. First, the 46 3-stranded coiled coil sequences are divided 
into the following subgroups: a-fibrinogens, 6-fibrinogens, y-fibrinogens, laminins, tail fibers, 
heat shocks, and all remaining protein sequences. Next, each of these subgroups is randomly 
divided into two parts, one for each half; this ensures that in the final split, each half is fairly 
representative of examples of the 3-stranded coiled coil motif. 

We split the 3-stranded coiled coil sequences 3 times in the above manner. This then gives 
us six different iteration and evaluation sets. Each evaluation set consists of 23 3-stranded 
coiled coil sequences, and the corresponding iteration test set consists of 5493 sequences (the 
original 5516 sequences, minus the 23 sequences in the evaluation set). We run LEARNCOIL 
on each of the six iteration test sets, and evaluate the algorithm by its performance on the 
corresponding evaluation sets (namely, those 3-stranded coiled coil sequences which are not 
included in the iteration test set). Note that the set of sequences with solved structures that 
do not contain coiled coils are included in all iteration test sets, and are scored using the leave 
one out criterion. 

For each iteration test set, our algorithm is again run 5 times with A = .1, and with final 
likelihoods averaged over the runs. Table 4.2 gives the performance of our algorithm on the 
different evaluation sets. On average, LEARNCOIL selects out 85% of the 3-stranded coiled coil 
sequences not originally in the set of sequences upon which it iterates. In contrast, PAIRCOIL 
on average selects out 67% on the same sets of sequences. In all but one of the six experiments, 
the algorithm does not get any false positives from the set of solved structures. In the one 
scenario when it does get a false positive, the likelihood of all sequences in the corresponding 
evaluation set (B1) that score above 50% also score higher than this false positive. 

The average performance of LEARNCOIL on the 3-stranded coiled coil sequences included 
in the iteration test set is 88%. (Individual performance data for each of the six experiments is 


not shown.) This average does not seem to be significantly higher than the algorithm’s average 
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Base Set Evaluation Performance Performance 
Set without LEARNCOIL with LEARNCOIL 
% of seqs # of false % of seqs # of false 
ore) positive seqs ore) positive seqs 


Set Al, 23 3-str CCs 0/286 0/286 


Set A2, 23 3-str CCs 0/286 0/286 


Table 4.2: Learning 3-stranded coiled coils from 2-stranded coiled coils using the leave half out 
criterion. The 3-stranded coiled coil sequences are split 3 times, giving us six different iteration 
and evaluation sets. The evaluation sets are Al, A2, B1, B2, Cl and C2 (Al and A2 are a 
result of one split, etc.). 


performance on the sequences in the evaluation set. Thus in comparing the results in Table 4.2 
with the results in Table 4.1, it appears that the decreased performance on these runs with 
the splits is the result of fewer available 3-stranded coiled coil sequences to the algorithm, and 
not upon whether the evaluation criterion is the leave one out criterion or the leave half out 


criterion. 


4.4.3 Learning subclasses of 2-stranded coiled coils 


Our results on subclasses of the 2-stranded coiled coil motif indicate that we are able to “learn” 
coiled coil regions in one family of proteins using a database consisting of coiled coils from 
another family of proteins. For example, we are able to learn coiled coils in intermediate 
filaments from a database of coiled coils in either myosins or tropomyosins. Our techniques 
improve non-learning based approaches, such as the PAIRCOoIL program [14], which fail to 
identify conjectured coiled coil residue positions. 

We tested LEARNCOIL on three different domains (Table 4.3): tropomyosins (TROPs) as a 
base set and myosins (MYOs) and intermediate filaments (IE's) as an evaluation set; myosins 
as a base set and tropomyosins and IFs as an evaluation set; IFs as a base set and myosins 


and tropomyosins as an evaluation set. A different set of iteration test sequences was used for 
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Base Set Evaluation Performance Performance 
Set without LEARNCOIL with LEARNCOIL 


% of # of false % of # of false 
residues | positive seqs | residues | positive seqs 


TROPs | MYOs + IFs 4/286 1/286 
TROPs + TFs 2/286 1/286 
Tes [MYOs F TROPs| 83% [4/286 2/286 


Table 4.3: Learning 2-stranded coiled coils from a restricted set 


each of these tests; that is, the set that includes sequences of the two protein families in the 
evaluation set. For these experiments, we have residue data, and thus our performance measure 
is with respect to these. False negatives are residues of sequences in the evaluation set which 
do not have at least a 50% likelihood. False positives are defined as in section 4.4.2 

Here the weight of the original database was empirically chosen to be A = 0.3. One possible 
explanation for this is since the subclasses of 2-stranded coiled coils has more similarities than 
differences, the program does not have to be so aggressive in picking up the evaluation set. 
Moreover, the goal is a target set of 2-stranded coiled coils, and this is best achieved by weighting 
each of the 3 types of proteins equally. We also experimented with weights of A = 0.1 and 
A = 0.5, and while their overall performance was similar, they produced more false positives. 

First, we consider experiments with tropomyosins in the base set and myosins and IFs in 
the evaluation set. LEARNCOIL positively identifies 99% of the myosin and IF residues in 
the 2-stranded database and makes one false positive prediction. This is in contrast to PAIR- 
Colt, which obtained a performance of 70.9%, with four false positive and two false negative 
predictions. 

Next we consider experiments with a base set of myosins and an evaluation set of tropo- 
myosins and IFs. LEARNCOIL positively identifies 99% of the tropomyosin and IF residues 
and makes one false positive prediction. This is in contrast to PAIRCOIL, which obtained a 
performance of 88.8%, with two false positive and one false negative predictions. 

Lastly, we consider experiments with a base set of IF's and an evaluation set of tropomyosins 
and myosins. LEARNCOIL positively identifies 99.4% of the tropomyosin and IF residues and 


makes two false positive predictions. One possible explanation for more false positives here is 
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that the IFs have a less obvious coiled-coil structure and there very well may be non-coiled coil 
residues in the database; consequently, starting with a table of solely IP's may select out non- 
coiled coils for the target database. In contrast, PAIRCOIL obtained a performance of 83.3%, 
with four false positive predictions. 

For all three above experiments, LEARNCOIL improved performance of PAIRCOIL in iden- 
tifying coiled coil residues, while also improving its false positive rate. 

We also tested LEARNCOIL with the NEWCOILS program [58] used as the underlying scoring 
algorithm. For subclasses of 2-stranded coiled coils, we found that LEARNCOIL enhanced the 
performance of NEWCOoILS as well. It obtained a performance of 96.2% when tropomyosins 
were used as the base set, a performance of 95.3% when myosins were used, and a performance 
of 98.2% when IFs were used. The program did not make any false positive predictions when 
run on these three test domains. In contrast, the non-learning based version of NEWCOILS had 
substantial overlap between the residue scores for coiled coils and non-coiled coils in all of the 


three test domains. 


4.4.4. New coiled-coil-like candidates 


The LEARNCOIL program has identified many new sequences that we believe contain coiled- 
coil-like structures. Table 4.4 lists some examples of “newly found” viral proteins (i.e., proteins 
for which PAIRCOIL indicates that no coiled coil is present, but LEARN COIL indicates a coiled- 
coil-like structure is present). We believe that the proteins given in Table 4.4 either contain 
coiled coils or coiled-coil-like structures. For example, recent biological work has identified a 
coiled-coil-like structure which is believed to consist of a parallel, trimeric coiled coil encircled 
by three helices packed in an antiparallel formation; this structure is thought to be in the 
envelope glycoproteins of both HIV and SIV (Simian Immunodeficiency Virus) [19, 56]. 

Our program seems to be able to accurately predict this new coiled-coil-like structure. For 
example, it identifies two coiled-coil-like regions in the envelope protein of SIV. Independently, 
the biological investigation of SIV by Blacklow et al. predicts that these are the two regions 
that are part of the coiled-coil-like structure [19]. One of these regions (comprising the outer 
three helices) is predicted by the NEWCoIL program and is given a 26% likelihood by the 


PaIRCOIL program. The other region (comprising the trimeric coiled coil) is only predicted by 
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our LEARNCOIL program. This region corresponds to the N-terminal fragment in the paper 
of Blacklow et al. In fact, the region LEARNCOIL predicts and the region that Blacklow et al. 
find are almost identical: LEARNCOIL predicts a coiled-coil-like structure starting at residue 
553 and ending at residue 601, whereas Blacklow et al. start the region at residue 552 and end 


it at residue 604. 


PIR Name LEARNCOIL | PAIRCOIL 
Likelihood | Likelihood 
mouse hepatitis virus E2 glycoprotein precursor 


human rotavirus A glycoprotein NCVP5 
human respiratory syncytial virus fusion glycoprotein 


human T-cell surface glycoprotein CD4 precursor 


human T-cell lymphotropic virus — type I, env 
equine infectious anemia virus, env 

fruit fly 14-3-3 protein 

HIV, env 

SIV, env 


Table 4.4: Newly discovered coiled-coil-like candidates 


Moreover, there is biological evidence that several other of the sequences in Table 4.4 contain 
coiled-coil-like structures. Our predictions were made independently of these results. Recently, 
the crystal structure of two 14-3-3 proteins have been solved [55, 75]. The paper of Liu et al. 
studies the zeta transform of the 14-3-3 structure in E. coli, and they report a 2-stranded anti- 
parallel coiled coil structure. On the other hand, the paper of Xiao et al. studies the human 
T-cell + dimer, and they report helical bundles. Although there is some uncertainty here, it 
is likely that the 14-3-3 protein we have identified contains a coiled-coil-like structure, if not a 
coiled coil itself. The Human T-cell lymphotropic virus and equine infectious anemia virus are 
closely related to HIV, and thus their envelope proteins are also likely to contain coiled-coil-like 
structures. 

The proteins reported in Table 4.4 are compared to the PAIRCOIL program. The NEWCoIL 
program of Lupas et al. finds some of these proteins; however, in general, this program finds 
a significant number of false positives. The 14-3-3 protein, the human T-cell lymphotropic 


virus envelope protein and the human T-cell surface glycoprotein CD4 precursor are found only 
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using our LEARNCOIL program. As mentioned above, there is some biological evidence that at 
least two of these proteins (the 14-3-3 protein and human T-cell lymphotropic virus envelope 
protein) contain coiled-coil-like structures. 

We anticipate that the identification of likely coiled-coil-like regions in important protein 
sequences (such as those in Table 4.4) will facilitate and expedite the study of protein structure 
by biologists. In addition, since our program LEARNCOIL is able to identify the new coiled- 
coil-like motif in HIV and SIV, it is possible that our program will help aid in the discovery of 


this structure in other retroviruses. 


4.5 Conclusions 


In this chapter, we have shown that a learning-based algorithm that uses randomness and 
statistical techniques can substantially enhance existing methods for protein motif recognition. 
We have designed a program LEARNCOIL and demonstrated its ability to “learn” the 2-stranded 
and 3-stranded coiled coil motif. It has identified new sequences that we believe contain coiled- 
coil-like structures. It is our hope that biologists will use this program to help identify other 
new coiled-coil-like structures. 

There is evidence that our program may have identified a new coiled-coil-like motif that 
occurs in retroviruses, and future work involves studying retroviruses and this motif more 
closely. 

In the future we plan to apply the LEARNCOIL program to motifs other than those that 
have coiled-coil-like properties. Limited data is a problem for many protein structure prediction 
problems. There are newly discovered protein motifs for which biologists cannot yet predict, and 
more importantly, do not yet even know the structural features that characterize the motifs. We 
hope to extend the techniques developed here to aid in the determination of crucial structural 
features that give rise to these motifs, as well as to learn how to predict which proteins exhibit 


this motif. 


CHAPTER 5 


Concluding remarks 


In this thesis, we have studied three problems in machine learning. In the first part of the thesis, 
we examined Valiant’s PAC model, and considered learnability in this model. In particular, we 
studied concept classes of functions on & terms, and gave an algorithm for learning any function 
on & terms by general DNF. On the other hand, we showed that if the learner is restricted so 
that it must output a hypothesis which is a member of the concept class being learned, then 
learning the concept class of any symmetric function on & terms is NP-hard (except for the 
concept classes of AND, NOT AND, TRUE and FALSE). Our results completely characterize 
the learnability of concept classes of symmetric functions on & terms. We leave as an open 
problem whether concept classes for more general functions on & terms can be learned when 
the learner’s output hypothesis is restricted. 

The second part of the thesis introduced the problem of piecemeal learning an unknown 
environment. For environments that can be modeled as grid graphs with rectangular obstacles, 
we gave two piecemeal learning algorithms in which the robot traverses a linear number of 
edges. For more general environments that can be modeled as arbitrary undirected graphs, we 
gave a nearly linear algorithm. An interesting open problem is whether there exists a linear 
algorithm for piecemeal learning arbitrary undirected graphs. Piecemeal learning takes into 
account just one of the limitations on a robot’s resources. It would be interesting to come up 


with models and algorithms to handle other practical limitations of a robot, such as incorrect 
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data that a robot may receive (due to noisy sensors) and difficulties a robot may have in motor 
control. Other extensions of the work might include the scenario of multiple robots, or multiple 
“refueling stations.” 

In the last part of the thesis, we applied machine learning techniques to the problem of 
protein folding prediction. We gave an iterative learning algorithm that is particularly effective 
for folds for which there is not much currently available data. We implemented our algorithm, 
and showed its effectiveness on the 3-stranded coiled coil motif. There are other motifs for 
which there is a lack of data, such as {-rolls and {-helices, and it would interesting to extend 
our techniques to work on these motifs. In addition, there is evidence that our program may 
have identified a new coiled-coil-like motif that occurs in retroviruses, and future work involves 


studying this motif more closely. 
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